iPhylo: SPARQL

Roderic D. M. Page

Showing posts with label SPARQL. Show all posts

Monday, December 20, 2021

GraphQL for WikiData (WikiCite)

I've released a very crude GraphQL endpoint for WikiData. More precisely, the endpoint is for a subset of the entities that are of interest to WikiCite, such as scholarly articles, people, and journals. There is a crude demo at https://wikicite-graphql.herokuapp.com. The endpoint itself is at https://wikicite-graphql.herokuapp.com/gql.php. There are various ways to interact with the endpoint, personally I like the Altair GraphQL Client by Samuel Imolorhe.

As I've mentioned earlier it's taken me a while to see the point of GraphQL. But it is clear it is gaining traction in the biodiversity world (see for example the GBIF Hosted Portals) so it's worth exploring. My take on GraphQL is that it is a way to create a self-describing API that someone developing a web site can use without them having to bury themselves in the gory details of how data is internally modelled. For example, WikiData's query interface uses SPARQL, a powerful language that has a steep learning curve (in part because of the administrative overhead brought by RDF namespaces, etc.). In my previous SPARQL-based projects such as Ozymandias and ALEC I have either returned SPARQL results directly (Ozymandias) or formatted SPARQL results as schema.org DataFeeds (equivalent to RSS feeds) (ALEC). Both approaches work, but they are project-specific and if anyone else tried to build based on these projects they might struggle for figure out what was going on. I certainly struggle, and I wrote them!

So it seems worthwhile to explore this approach a little further and see if I can develop a GraphQL interface that can be used to build the sort of rich apps that I want to see. The demo I've created uses SPARQL under the hood to provide responses to the GraphQL queries. So in this sense it's not replacing SPARQL, it's simply providing a (hopefully) simpler overlay on top of SPARQL so that we can retrieve the data we want without having to learn the intricacies of SPARQL, nor how Wikidata models publications and people.

Friday, December 13, 2019

The Semantic Web revisited: thoughts on SWAT4HCLS

This week I attended the SWAT4(HC)LS (Semantic Web Applications and Tools for Healthcare and Life Sciences) meeting in Edinburgh. Although a relatively small meeting, SWAT4(HC)LS attracts some big names in the field and featured keynotes by Denny Vrandečić (founder of Wikidata), Dov Greenbaum, Birgitta König-Ries, and Helen Parkinson.
For me this was a chance to get a sense of the state of the Semantic Web, and also to present a talk on biodiversity knowledge graphs. Given that this is a computer science meeting, you need to get a paper submitted and accepted in order to give a talk, so I hastily wrote up some notes on matching author names in taxonomic and bibliographic databases (there's a version of this on bioRxiv):

Page, R. D. M. (2019). Reconciling author names in taxonomic and publication databases. doi:10.1101/870170

Google the "Semantic Web" and pretty soon you discover that many people think it is dead (see Whatever Happened to the Semantic Web?). But it is still here, maybe partly because there is some ambiguity about just what it is. The 2003 paper "Which semantic web?" By Catherine C. Marshall and Frank M. Shipman (doi:10.1145/900051.900063) sketches three different Semantic Webs:

a universal library, to be readily accessed and used by humans in a variety of information use contexts.
the backdrop for the work of computational agents completing sophisticated activities on behalf of their human counterparts
a method for federating particular knowledge bases and databases to perform

(1) is essentially what Google gives us, the ability to use a web browser to find stuff on the web, augmented by structured markup to help us do that (the "Library of Alexandria"). (2) is the idea of global ontologies, agents, and reasoning (the Knowledge Navigator), and (3) focusses on cross linking data in different databases (the "Federated Knowledge Base").

My own focus is very much in area (3), I want to link disconnected datasets together. Many of the presentations at SWAT4(HC)LS were more in area (2) and focussed on ontologies, especially medical. This is a world of big - not always open - ontologies, and lots of discussions about how to model data. In other words, what many people think of as the Semantic Web.

One of the nice things about the conference was the way people with posters got to give a lightning talk about their poster (I've seen this at VIZBI as well). I think this is a great idea and would love to see this at biodiversity conferences. The posters that I got the most out of were from the researchers at the DBCLS in Japan, such as TogoStanza (visualisations of SPARQL results), SPARQList (Markdown notebook for SPARQL), and Umaka Viewer (visualise classes in a SPARQL endpoint).

For fun I tried Umaka Viewer on my Ozymandias knowledge graph. You can see the results here.
It took about 30 minutes to generate the data for this visualisation, but it was fun to poke around at the internals of a knowledge graph that I had created. I discovered classes I'd forgotten I'd used!

As someone who spends a lot of time messing about with ways to collect, clean, and visualise data, it's no surprise that posters and presentations on tools for doing this are what I found most useful. The thing I find most appealing about the Semantic Web is the notion of having simple APIs that can query knowledge encoded in both web pages and databases (see also work by Franck Michel and colleagues on SPARQL Micro-Services, e.g. SPARQL Micro-Services Demo Page).

Friday, August 10, 2018

Ozymandias: a biodiversity knowledge graph of Australian taxa and taxonomic publications

In the spirit of release early and release often, here is the first workable version of a biodiversity knowledge graph that I've been working on for Australian animals (for some background on knowledge graphs see Towards a biodiversity knowledge graph now in RIO). The core of this knowledge graph is a classification of animals from the Atlas of Living Australia (ALA) combined with data on taxonomic names and publications from the Australian Faunal Directory (AFD). This has been enhanced by adding lots of digital identifiers (such as DOIs) to the publications and, where possible, full text either as PDFs or as page scans from the Biodiversity Heritage Library (BHL) (provided via BioStor). Identifiers enable us to further grow the knowledge graph, for example by adding "cites" and "cited by" links between publications (data from CrossRef), and displaying figures from the Biodiversity Literature Repository (BLR).

The demo is here: https://ozymandias-demo.herokuapp.com/ If you’re looking for starting points, you could try:

Assassin spiders (images from Plazi and citation data from CrossRef) https://ozymandias-demo.herokuapp.com/?uri=https://biodiversity.org.au/afd/publication/64908f75-456b-4da8-a82b-c569b4806c22

Memoirs of Museum Victoria (dynamic query finds record in Wikidata and adds map) https://ozymandias-demo.herokuapp.com/?uri=https://biodiversity.org.au/afd/publication/5c22a8d1-7456-4f8c-9384-1246ecbf15a6

G. R. Allen (we can from the taxonomic tree of his top 20 taxa that he studies fish - who knew?) https://ozymandias-demo.herokuapp.com/?uri=https://biodiversity.org.au/afd/publication/%23creator/g-r-allen

Paper on mosquito taxonomy with lots of citations, including material in BHL/BioStor https://ozymandias-demo.herokuapp.com/?uri=https://biodiversity.org.au/afd/publication/578d1dec-5816-49ec-8916-3f957fd230f5

Paper on Australian flies with full text in BioStor https://ozymandias-demo.herokuapp.com/?uri=https://biodiversity.org.au/afd/publication/0ffe4f28-b8ac-4132-be34-19eb03fbf685

The focus for now is on taxa, publications, journals, and people. Occurrences and sequences are on the “to do” list. As always there’s lots of data cleaning and cross linking to do, but an obvious next step is to link people’s names to identifiers such as ORCID and Wikidata ids, so that we can trace the activities of taxonomists as they discover and describe Australian biodiversity (the choice of Australia is simply to keep things manageable, and because the amount of data and digitisation they’ve done is pretty extraordinary). I’m also working to a deadline as I'm trying to get this demo wrapped up in the next couple of weeks.

Technical details

TL;DR the knowledge graph is implemented as a triple store where the data has been represented using a small number of vocabularies (mostly schema.org with some terms borrowed from TAXREF-LD and the TDWG LSID vocabularies). All results displayed in the first two panels are the result of SPARQL queries, the content in the rightmost panel comes from calls to external APIs. Search is implemented using Elasticsearch. If you are feeling brave you can query the knowledge graph directly in SPARQL. I’m constantly tweaking things and adding data and identifiers, so things are likely to break. More details and documentation will be going up on the GitHub repository.

Wednesday, May 31, 2017

Querying Wikidata

Over 7 million #SPARQL queries/day in @wikidata #WikiCite 👏🏻 pic.twitter.com/l2I6IcnGJj
— WikiCite (@Wikicite) May 23, 2017

For my own use more than anything else I've started creating a list of Wikidata SPARQL queries here. I personally don't find Wikidata's data model particularly easy to grasp, so one way to learn is to take the example queries on the Wikidata Query site and mess about with them.

For those interested in taxonomic data Wikidata is quite rich in content. For example, you can find the author of a taxonomic names, or find taxon names an author is responsible for creating.

It is also fairly straightforward to search for content by identifier, e.g.

SELECT *
WHERE
{
  ?work wdt:P356 "10.2476/ASJAA.62.33" .
}

will find the article with the DOI 10.2476/ASJAA.62.33. One minor gotcha is that Wikidata has all DOIs in UPPERCASE, so you either need to sera for uppercase version of the DOI, or use a filter to convert the case, which is slow.

As I come across interesting or useful queries I'll add them to the list in GitHub.

Saturday, January 14, 2017

Displaying taxonomic classifications from Wikidata using d3js and SPARQL

Sahelanthropus tchadensis TM 266 01 060 1 Following on from previous posts The Semantic Web made fun: d3sparql and The Biodiversity Heritage Library meets Wikidata via Wikispecies: adding author identifiers to BioStor I've put together an example query that can be used to extract a taxonomic classification from Wikidata. The query is inspired by the http://biohackathon.org/d3sparql/ example, and uses the wikidata property P171 ("parent taxon") which is subproperty of rdfs:subClassOf (the property used in the d3sparql example which queries the Uniprot taxonomy).

The following SPARQL query generates a list of nodes in the tree representing the classification of Hominini (humans, chimps, and their extinct relatives):

PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wd: <http://www.wikidata.org/entity/>
SELECT ?root_name ?parent_name ?child_name WHERE
{
 VALUES ?root_name {"Hominini"}
 ?root wdt:P225 ?root_name .
 ?child wdt:P171+ ?root .
 ?child wdt:P171 ?parent .
 ?child wdt:P225 ?child_name .
 ?parent wdt:P225 ?parent_name .
}

Using https://query.wikidata.org/sparql as the endpoint, in http://biohackathon.org/d3sparql/ this generates the following diagram:

There are some obvious issues with this classification, such as genera that lack descendant species (e.g., Cyphanthropus). Indeed, we could imagine developing SPARQL queries to flag up such errors (see A use case for RDF in taxonomy). But the availability and accessibility of Wikidata and its SPARQL interface makes it a great playground to explore the utility of SPARQL for exploring taxonomic data.

Wednesday, January 11, 2017

The Biodiversity Heritage Library meets Wikidata via Wikispecies: adding author identifiers to BioStor

I've added an experimental feature to BioStor that uses data from Wikidata and Wikispecies to augment what information BioStor displays on authors. This is a crude first step towards the goal of representing all the data in BioStor as a "knowledge graph" where articles, journals, and authors are all treated as entities, all have identifiers, and we can explore relationships between those entities (e.g., citation, co-authorship, etc.). At the moment this is true of articles, which have Biostor URLs (and in many cases DOIs), and for most journals which are identified by their ISSN. Using identifiers helps reduce ambiguity, especially if there are multiple ways to represent the same thing (e.g., all the alternative ways to write a journal name can be circumvented by using the journal's ISSN).

However, BioStor doesn't have a way to identify authors beyond simply searching for a name. As a first step to tackling this problem I've added a little widget that displays information about an author based on the name you are searching for. For example, searching for George Albert Boulenger will give you a list of publications where the author name is "George Albert Boulenger", as well as a picture of the author and some identifiers (from sources such as VIAF, ISNI, IPNI, and Wikidata):

For now this widget is independent of the data in BioStor. I don't link an article to its author(s) using identifiers for those authors, nor have I tackled the problem of clustering all the variations in people's names together into one set of names that share the same identifier (see Equivalent author names) nor do I attempt to match names to identifiers (see Reconciling author names using Open Refine and VIAF) other than by an exact text search (for details see below). At this stage I just want to get a sense of what identifiers exist for an author, and what I can learn from those identifiers. I also want to explore the potential of Wikispecies as a source of data on people and publications, and how this relates to Wikidata (for earlier thoughts on using Wikipedia for the same goal see Thoughts on Wikipedia, Wikidata, and the Biodiversity Heritage Library).

Wikispecies

I confess I've never really "got" Wikispecies (e.g., Wikispecies is not a database), it seems to exist in isolation from Wikipedia, which is arguably more informative about many species. But there are a couple of things Wikispecies does very well. Firstly, it is building a rich, crowd-sourced bibliography of papers on the taxonomy of many different species. Readers of iPhylo will recall how many times I've expressed frustration at the nearly evidence-free nature of many online taxonomic databases that simply have lists of names unconnected to the primary literature. Many Wikispecies pages have long lists of papers, making it a potential goldmine. Recently there is a lot of interest in extracting bibliographic data from Wikipedia (see WikiCite). Wikispecies could also be harvested, although a major obstacle any such project faces is the lack of a consistent format for references in Wikispecies.

The other nice thing about Wikipecies is that it has articles on taxonomic authorities, and these often list publications by those authors, and also list external identifiers for those authors, such as the VIAF and ISNI identifiers used in the library world, IPNI and ZooBank identifiers used in taxonomic databases, and ORCID which is becoming the de-facto identifier for academic researchers. This information also ends up in Wikidata.

Using Wikidata to glue things together

Wikidata is an interesting project that, like Wikispecies, I've been in two minds about (see Wikidata, Wikipedia, and #wikisci). However, I've started to make more use of it recently. Inspired by the Wikidata:SPARQL query service/2016 SPARQL Workshop I decided to explore the SPARQL query interface to Wikidata. I was struck by one of the example queries involving Wikispecies, and so after a little bit of messing about came up with a query that takes the name of an author and returns some identifiers from Wikidata, as well as an image of that person if one is available. I restrict the results to people that have an article about them in Wikispecies, because I want start exploring using those articles to make assertions about authorship. Here is a query to search for "George Albert Boulenger":

SELECT *
WHERE
{
  ?item rdfs:label "George Albert Boulenger"@en .
  ?article schema:about ?item .
  ?article schema:isPartOf  .
  OPTIONAL {
   ?item wdt:P213 ?isni .
	}
  OPTIONAL {
   ?item wdt:P214 ?viaf .
	}
  OPTIONAL {
   ?item wdt:P18 ?image .
	}
  OPTIONAL {
   ?item wdt:P496 ?orcid .
	}
  OPTIONAL {
   ?item wdt:P586 ?ipni .
	}
  OPTIONAL {
   ?item wdt:P2006 ?zoobank .
	}
}

This query simply asks whether Wikidata has an item on this person, whether that item is linked to Wikispecies, what identifiers Wikidata has, and whether there is an image of the person. You can see the query "live" here:

I've added some code to BioStor to do this query on the fly, and display the results. So, for Boulenger we get: Screenshot 2017 01 11 17 04 16 Here is the result for noted carcinologist Jocelyn Crane who currently lacks identifiers: Screenshot 2017 01 11 17 05 32 A nice surprise was Bernard Landry: Screenshot 2017 01 11 17 07 14 Note the ORCID 0000-0002-6005-1067. Interestingly, Bernard Landry's ORCID profile doesn't list any publications, whereas we can see lists of these in BioStor and Wikispecies.

Where next?

There are several obstacles to mapping the names of authors to identifiers. One is simply the lack of identifiers. This seems to be rapidly becoming less of a problem with the efforts of the library community around VIAF, the rise of ORCID for living researchers, and the creation of Wikidata items for every taxonomist in Wikispecies. The next challenge is clustering the different ways of writing the same person's name into sets that represent the same person. As discussed above, there are tools for this. Furthermore, with Wikipedia and Wikispecies we have sources of lists of publications linked to a person and their identifiers, which should simplify the task considerably. What is nice about this is that it relies on a crowd-sourcing effort which is already well-established, namely those people who in adding articles to Wikispecies and Wikipedia are created a curated database of publications linked to authors. In many cases those publications are linked to BHL (the source that BioStor extracts its articles from), so many of the links between publications and people are essentially lying there, just waiting for some skilful harvesting.

Thursday, November 24, 2016

The Semantic Web made fun: d3sparql

Continuing my on-again off-again relationship with the Semantic Web, I stumbled across a cool approach to visualising the results of SPARQL queries. Toshiaki Katayama (@tktym) has put together d3sparql, a set of Javascript scripts that takes SPARQL queries and formats the results graphically using D3.

For example, give the SPARQL endpoint http://togostanza.org/sparql, the following query retrieves the NCBI classification for the tardigrade family Hypsibiidae:

PREFIX rdfs: 
PREFIX up: 
SELECT ?root_name ?parent_name ?child_name
FROM <http://togogenome.org/graph/uniprot>
WHERE
{
  VALUES ?root_name { "Hypsibiidae" }
  ?root up:scientificName ?root_name .
  ?child rdfs:subClassOf+ ?root .
  ?child rdfs:subClassOf ?parent .
  ?child up:scientificName ?child_name .
  ?parent up:scientificName ?parent_name .
}

By outputting the results as a list of parent-child pairs, it is straightforward to convert the output of this query into a form that D3 accepts, so we can get a tree like this:

The ability to quickly generate trees, charts, and maps from SPARQL queries makes things a lot easier. We can play around a little and explore things. The strength (and challenge) of SPARQL is that it is very open-ended, you can more or less develop queries to do anything. Being able to visualise the results will help guide that exploration.

The code for d3sparql is on GitHub. One "gotcha" is that the cached examples and external Javascript libraries aren't included. I've forked the repository here and added the missing files, so that if you grab that version it works straight out of the box.

Thursday, August 01, 2013

A use case for RDF in taxonomy

Readers of this blog will know that I'm sceptical about the current value of linked data and RDF in biodiversity informatics. But I came across an interesting paper on RDF and biocuration that suggests a good "use case" for RDF in constructing and curating taxonomic databases.

The paper is "Catching inconsistencies with the semantic web: a biocuration case study" (PDF here) by Jerven Bolleman and Sebastien Gehant. The basic idea is that errors in databases (in this case, UniProt) can be flagged by constructing queries in SPARQL that return results if there is a problem (for example if a sequence annotation is contradictory).

In recent posts I've been complaining about errors in the GBIF taxonomy, notably duplicate taxa that are synonyms. One way to tackle this would be to develop a set of SPARQL queries that we could use to flag potential problems. For example, if two names are objective synonyms then only one of them should be a node in the GBIF classification. If both exist then we have a problem. If we know a name is a homonym of an older name, but that name exists in the GBIF classification, then we could flag that as an issue. We could also construct queries that flag possible problems, even if we don't have precise information on synonymy. For example, in this post I noted that several frog species appear twice in the GBIF classification because GBIF has aggregated classifications that put these frogs in different genera. We could catch such cases by constructing a query to check whether the same species name (specific epithet) appeared in different genera within the same family.

The advantage of using RDF and SPARQL in this context is that that the queries are portable. Assuming everyone uses the same vocabulary (e.g., the TDWG LSID vocabularies) then queries can be constructed by one person (e.g., me) and then used by anyone who has their data in a triple store. We could develop a set of "taxonomy tests" that anyone could apply to their database.

This idea needs some more work, but it would be fun to play with some data and see how many kinds of errors or issues we can catch in this way.

Thursday, October 20, 2011

Reflections on the TDWG RDF "Challenge"

This is a follow up to my previous post TDWG Challenge - what is RDF good for? where I'm being, frankly, a pain in the arse, and asking why we bother with RDF? In many ways I'm not particularly anti-RDF, but it bothers me that there's a big disconnect between the reasons we are going down this route and how we are actually using RDF. In other words, if you like RDF and buy the promise of large-scale data integration while still being decentralised ("the web as database"), then we're doing it wrong.

As an aside, my own perspective is one of data integration. I want to link all this stuff together so I can follow a path through multiple datasets and extract the information I want. In other words, "linked data" (little "l", little "d"). I'm interested in fairly light weight integration, typically through shared identifiers. There is also integration via ontologies, which strikes me as a different, if related, problem, that in many ways is closer to the original vision of the Semantic Web as a giant inference engine. I think the concerns (and experience) of these two communities are somewhat different. I don't particularly care about ontologies, I want key-value pairs and reusable identifiers so I can link stuff together. If, for example, you're working on something like Phenoscape, then I think you have a rather more circumscribed set of data, with potentially complicated interrelationships that you want to make inferences on, in which case ontologies are your friend.

So, I posted a "challenge". It wasn't a challenge so much as a set of RDF to play with. What I'm interested in is seeing how easily we can string this data together to learn stuff. For example, using the RDF I posted earlier here is a table listing the name, conservation status, publication DOI and date, and (where available) image from Wikipedia for frogs with sequences in GenBank.

Species	Status	DOI	Year described
Atelopus nanay	CR	http://dx.doi.org/10.1655/0018-0831(2002)058[0229:TNSOAA]2.0.CO;2	2002
Eleutherodactylus mariposa	CR	http://dx.doi.org/10.2307/1466962	1992
Phrynopus kauneorum	CR	http://dx.doi.org/10.2307/1565993	2002
Eleutherodactylus eunaster	CR	http://dx.doi.org/10.2307/1563010	1973
Eleutherodactylus amadeus	CR	http://dx.doi.org/10.2307/1445557	1987
Eleutherodactylus lamprotes	CR	http://dx.doi.org/10.2307/1563010	1973
Churamiti maridadi	CR	http://dx.doi.org/10.1080/21564574.2002.9635467	2002
Eleutherodactylus thorectes	CR	http://dx.doi.org/10.2307/1445381	1988
Eleutherodactylus apostates	CR	http://dx.doi.org/10.2307/1563010	1973
Leptodactylus silvanimbus	CR	http://dx.doi.org/10.2307/1563691	1980
Eleutherodactylus sciagraphus	CR	http://dx.doi.org/10.2307/1563010	1973
Bufo chavin	CR	http://dx.doi.org/10.1643/0045-8511(2001)001[0216:NSOBAB]2.0.CO;2	2001
Eleutherodactylus fowleri	CR	http://dx.doi.org/10.2307/1563010	1973
Ptychohyla hypomykter	CR	http://dx.doi.org/10.2307/3672060	1993
Hyla suweonensis	DD	http://dx.doi.org/10.2307/1444138	1980
Proceratophrys concavitympanum	DD	http://dx.doi.org/10.2307/1565412	2000
Phrynopus bufoides	DD	http://dx.doi.org/10.1643/CH-04-278R2	2005
Boophis periegetes	DD	http://dx.doi.org/10.1111/j.1096-3642.1995.tb01427.x	1995
Phyllomedusa duellmani	DD	http://dx.doi.org/10.2307/1444649	1982
Boophis liami	DD	http://dx.doi.org/10.1163/156853803322440772	2003
Hyalinobatrachium ignioculus	DD	http://dx.doi.org/10.1670/0022-1511(2003)037[0091:ANSOHA]2.0.CO;2	2003
Proceratophrys cururu	DD	http://dx.doi.org/10.2307/1447712	1998
Amolops bellulus	DD	http://dx.doi.org/10.1643/0045-8511(2000)000[0536:ABANSO]2.0.CO;2	2000
Centrolene bacatum	DD	http://dx.doi.org/10.2307/1564528	1994
Litoria kumae	DD	http://dx.doi.org/10.1071/ZO03008	2004
Phrynopus pesantesi	DD	http://dx.doi.org/10.1643/CH-04-278R2	2005
Gastrotheca galeata	DD	http://dx.doi.org/10.2307/1443617	1978
Paratelmatobius cardosoi	DD	http://dx.doi.org/10.2307/1447976	1999
Rhacophorus catamitus	DD	http://dx.doi.org/10.1655/0733-1347(2002)016[0046:NAPKPF]2.0.CO;2	2002
Huia melasma	DD	http://dx.doi.org/10.1643/CH-04-137R3	2005
Telmatobius vilamensis	DD	http://dx.doi.org/10.1655/0018-0831(2003)059[0253:ANSOTA]2.0.CO;2	2003
Callulina kisiwamsitu	EN	http://dx.doi.org/10.1670/209-03A	2004
Arthroleptis nikeae	EN	http://dx.doi.org/10.1080/21564574.2003.9635486	2003
Eleutherodactylus amplinympha	EN	http://dx.doi.org/10.1139/z94-297	1994
Eleutherodactylus glaphycompus	EN	http://dx.doi.org/10.2307/1563010	1973
Bufo tacanensis	EN	http://dx.doi.org/10.2307/1439700	1952
Phrynopus bracki	EN	http://dx.doi.org/10.2307/1445826	1990
Telmatobius sibiricus	EN	http://dx.doi.org/10.1655/0018-0831(2003)059[0127:ANSOTF]2.0.CO;2	2003
Cochranella mache	EN	http://dx.doi.org/10.1655/03-74	2004
Eleutherodactylus melacara	EN	http://dx.doi.org/10.2307/1466962	1992
Plectrohyla glandulosa	EN	http://dx.doi.org/10.2307/1441046	1964
Aglyptodactylus laticeps	EN	http://dx.doi.org/10.1111/j.1439-0469.1998.tb00775.x	1998
Eleutherodactylus glamyrus	EN	http://dx.doi.org/10.2307/1565664	1997
Gastrotheca trachyceps	EN	http://dx.doi.org/10.2307/1564375	1987
Eleutherodactylus grahami	EN	http://dx.doi.org/10.2307/1563929	1979
Litoria havina	LC	http://dx.doi.org/10.1071/ZO9930225	1993
Crinia riparia	LC	http://dx.doi.org/10.2307/1440794	1965
Litoria longirostris	LC	http://dx.doi.org/10.2307/1443159	1977
Osteocephalus mutabor	LC	http://dx.doi.org/10.1163/156853802320877609	2002
Leptobrachium nigrops	LC	http://dx.doi.org/10.2307/1440966	1963
Pseudis tocantins	LC	http://dx.doi.org/10.1590/S0101-81751998000400011	1998
Mantidactylus argenteus	LC	http://dx.doi.org/10.1111/j.1096-3642.1919.tb02128.x	1919
Aglyptodactylus securifer	LC	http://dx.doi.org/10.1111/j.1439-0469.1998.tb00775.x	1998
Pseudis cardosoi	LC	http://dx.doi.org/10.1163/156853800507264	2000
Uperoleia inundata	LC	http://dx.doi.org/10.1071/AJZS079	1981
Litoria pronimia	LC	http://dx.doi.org/10.1071/ZO9930225	1993
Litoria paraewingi	LC	http://dx.doi.org/10.1071/ZO9760283	1976
Philautus aurifasciatus	LC	http://dx.doi.org/10.1163/156853887X00036	1987
Proceratophrys avelinoi	LC	http://dx.doi.org/10.1163/156853893X00156	1993
Osteocephalus deridens	LC	http://dx.doi.org/10.1163/156853800507525	2000
Gephyromantis boulengeri	LC	http://dx.doi.org/10.1111/j.1096-3642.1919.tb02128.x	1919
Crossodactylus caramaschii	LC	http://dx.doi.org/10.2307/1446907	1995
Rana yavapaiensis	LC	http://dx.doi.org/10.2307/1445338	1984
Boophis lichenoides	LC	http://dx.doi.org/10.1163/156853898X00025	1998
Megistolotis lignarius	LC	http://dx.doi.org/10.1071/ZO9790135	1979
Ansonia endauensis	NE	http://dx.doi.org/10.1655/0018-0831(2006)62[466:ANSOAS]2.0.CO;2	2006
Ansonia kraensis	NE	http://dx.doi.org/10.2108/zsj.22.809	2005
Arthroleptella landdrosia	NT	http://dx.doi.org/10.2307/1565359	2000
Litoria jungguy	NT	http://dx.doi.org/10.1071/ZO02069	2004
Phrynobatrachus phyllophilus	NT	http://dx.doi.org/10.2307/1565925	2002
Philautus ingeri	VU	http://dx.doi.org/10.1163/156853887X00036	1987
Gastrotheca dendronastes	VU	http://dx.doi.org/10.2307/1445088	1983
Hyperolius cystocandicans	VU	http://dx.doi.org/10.2307/1443911	1977
Boophis sambirano	VU	http://dx.doi.org/10.1080/21564574.2005.9635520	2005
Ansonia torrentis	VU	http://dx.doi.org/10.1163/156853883X00021	1983
Telmatobufo australis	VU	http://dx.doi.org/10.2307/1563086	1972
Stefania coxi	VU	http://dx.doi.org/10.1655/0018-0831(2002)058[0327:EDOSAH]2.0.CO;2	2002
Oreolalax multipunctatus	VU	http://dx.doi.org/10.2307/1564828	1993
Eleutherodactylus guantanamera	VU	http://dx.doi.org/10.2307/1466962	1992
Spicospina flammocaerulea	VU	http://dx.doi.org/10.2307/1447757	1997
Cycloramphus acangatan	VU	http://dx.doi.org/10.1655/02-78	2003
Leiopelma pakeka	VU	http://dx.doi.org/10.1080/03014223.1998.9517554	1998
Rana okaloosae	VU	http://dx.doi.org/10.2307/1444847	1985
Phrynobatrachus uzungwensis	VU	http://dx.doi.org/10.1163/156853883X00030	1983

This is a small fraction of the frog species actually in GenBank because I've filtered it down to those that have been linked to Wikipedia (from where we get the conservation status) and which were described in papers with DOIs (from which we get the date of description).

I generated this result using this SPARQL query on a triple store that had the primary data sources (Uniprot, Dbpedia, CrossRef, ION) loaded, together with the all-important "glue" datasets that link ION to CrossRef, and Uniprot to Dbpedia (see previous post for details):


PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbpedia-owl: <http://dbpedia.org/ontology/>
PREFIX uniprot: <http://purl.uniprot.org/core/>
PREFIX tdwg_tn: <http://rs.tdwg.org/ontology/voc/TaxonName#>
PREFIX tdwg_co: <http://rs.tdwg.org/ontology/voc/Common#>
PREFIX dcterms: <http://purl.org/dc/terms/>

SELECT ?name ?status ?doi ?date ?thumbnail
WHERE {
  ?ncbi uniprot:scientificName ?name .
  ?ncbi rdfs:seeAlso ?dbpedia .
  ?dbpedia dbpedia-owl:conservationStatus ?status .
  ?ion  tdwg_tn:nameComplete ?name . 
  ?ion tdwg_co:publishedInCitation ?doi .
  ?doi dcterms:date ?date .

  OPTIONAL
  {
   ?dbpedia dbpedia-owl:thumbnail ?thumbnail
  }
} 
ORDER BY ASC(?status)

This table doesn't tell us a great deal, but we could, for example, graph date of description against conservation status (CR=critical, EN=endangered, VU=vulnerable, NT=not threatened, LC=least concern, DD=data deficient):
Chart

In other words, is it the case that more recently described species are more likely to be endangered than taxa we've known about for some time (based on the assumption that we've found all the common species already)? We could imagine extending this query to retrieve sequences for a class of frog (e.g., critically endangered) so we could compute a measure population genetic variation, etc. We shouldn't take the graph above too seriously because it's based on small fraction of the data, but you get the idea. As more frog taxonomy goes online (there's a lot of stuff in BHL and BioStor, for example) we could add more dates and build a dataset worth analysing properly.

It seems to me that these should be fairly simple things to do, yet they are the sort of thing that if we attempt today it's a world of hurt involving scripts, Excel, data cleaning, etc. before we can do the science.

The thing is, without the "glue" files mapping identifiers across different databases even this simple query isn't possible. Obviously we have no say in how many organisations publish RDF, but within the biodiversity informatics community we should make every effort to use external identifiers wherever possible so that we can make these links. This is the core of my complaint. If we are using RDF to foster data integration so we can query across the diverse data sets that speak to biodiversity, then we are doing it wrong.

Update
Here is a nice visualisation of this dataset from @orovellotti (original here), made using ecoRelevé:

AcNbdh2CMAA3ysc png large