Showing posts with label SPARQL. Show all posts
Showing posts with label SPARQL. Show all posts

Monday, December 20, 2021

GraphQL for WikiData (WikiCite)

I've released a very crude GraphQL endpoint for WikiData. More precisely, the endpoint is for a subset of the entities that are of interest to WikiCite, such as scholarly articles, people, and journals. There is a crude demo at https://wikicite-graphql.herokuapp.com. The endpoint itself is at https://wikicite-graphql.herokuapp.com/gql.php. There are various ways to interact with the endpoint, personally I like the Altair GraphQL Client by Samuel Imolorhe.

As I've mentioned earlier it's taken me a while to see the point of GraphQL. But it is clear it is gaining traction in the biodiversity world (see for example the GBIF Hosted Portals) so it's worth exploring. My take on GraphQL is that it is a way to create a self-describing API that someone developing a web site can use without them having to bury themselves in the gory details of how data is internally modelled. For example, WikiData's query interface uses SPARQL, a powerful language that has a steep learning curve (in part because of the administrative overhead brought by RDF namespaces, etc.). In my previous SPARQL-based projects such as Ozymandias and ALEC I have either returned SPARQL results directly (Ozymandias) or formatted SPARQL results as schema.org DataFeeds (equivalent to RSS feeds) (ALEC). Both approaches work, but they are project-specific and if anyone else tried to build based on these projects they might struggle for figure out what was going on. I certainly struggle, and I wrote them!

So it seems worthwhile to explore this approach a little further and see if I can develop a GraphQL interface that can be used to build the sort of rich apps that I want to see. The demo I've created uses SPARQL under the hood to provide responses to the GraphQL queries. So in this sense it's not replacing SPARQL, it's simply providing a (hopefully) simpler overlay on top of SPARQL so that we can retrieve the data we want without having to learn the intricacies of SPARQL, nor how Wikidata models publications and people.

Friday, December 13, 2019

The Semantic Web revisited: thoughts on SWAT4HCLS


This week I attended the SWAT4(HC)LS (Semantic Web Applications and Tools for Healthcare and Life Sciences) meeting in Edinburgh. Although a relatively small meeting, SWAT4(HC)LS attracts some big names in the field and featured keynotes by Denny Vrandečić (founder of Wikidata), Dov Greenbaum, Birgitta König-Ries, and Helen Parkinson.
For me this was a chance to get a sense of the state of the Semantic Web, and also to present a talk on biodiversity knowledge graphs. Given that this is a computer science meeting, you need to get a paper submitted and accepted in order to give a talk, so I hastily wrote up some notes on matching author names in taxonomic and bibliographic databases (there's a version of this on bioRxiv):
Page, R. D. M. (2019). Reconciling author names in taxonomic and publication databases. doi:10.1101/870170
Google the "Semantic Web" and pretty soon you discover that many people think it is dead (see Whatever Happened to the Semantic Web?). But it is still here, maybe partly because there is some ambiguity about just what it is. The 2003 paper "Which semantic web?" By Catherine C. Marshall and Frank M. Shipman (doi:10.1145/900051.900063) sketches three different Semantic Webs:


  1. a universal library, to be readily accessed and used by humans in a variety of information use contexts.
  2. the backdrop for the work of computational agents completing sophisticated activities on behalf of their human counterparts
  3. a method for federating particular knowledge bases and databases to perform
(1) is essentially what Google gives us, the ability to use a web browser to find stuff on the web, augmented by structured markup to help us do that (the "Library of Alexandria"). (2) is the idea of global ontologies, agents, and reasoning (the Knowledge Navigator), and (3) focusses on cross linking data in different databases (the "Federated Knowledge Base").

My own focus is very much in area (3), I want to link disconnected datasets together. Many of the presentations at SWAT4(HC)LS were more in area (2) and focussed on ontologies, especially medical. This is a world of big - not always open - ontologies, and lots of discussions about how to model data. In other words, what many people think of as the Semantic Web.

One of the nice things about the conference was the way people with posters got to give a lightning talk about their poster (I've seen this at VIZBI as well). I think this is a great idea and would love to see this at biodiversity conferences. The posters that I got the most out of were from the researchers at the DBCLS in Japan, such as TogoStanza (visualisations of SPARQL results), SPARQList (Markdown notebook for SPARQL), and Umaka Viewer (visualise classes in a SPARQL endpoint).

For fun I tried Umaka Viewer on my Ozymandias knowledge graph. You can see the results here.
It took about 30 minutes to generate the data for this visualisation, but it was fun to poke around at the internals of a knowledge graph that I had created. I discovered classes I'd forgotten I'd used!


As someone who spends a lot of time messing about with ways to collect, clean, and visualise data, it's no surprise that posters and presentations on tools for doing this are what I found most useful. The thing I find most appealing about the Semantic Web is the notion of having simple APIs that can query knowledge encoded in both web pages and databases (see also work by Franck Michel and colleagues on SPARQL Micro-Services, e.g. SPARQL Micro-Services Demo Page).

Friday, August 10, 2018

Ozymandias: a biodiversity knowledge graph of Australian taxa and taxonomic publications

In the spirit of release early and release often, here is the first workable version of a biodiversity knowledge graph that I've been working on for Australian animals (for some background on knowledge graphs see Towards a biodiversity knowledge graph now in RIO). The core of this knowledge graph is a classification of animals from the Atlas of Living Australia (ALA) combined with data on taxonomic names and publications from the Australian Faunal Directory (AFD). This has been enhanced by adding lots of digital identifiers (such as DOIs) to the publications and, where possible, full text either as PDFs or as page scans from the Biodiversity Heritage Library (BHL) (provided via BioStor). Identifiers enable us to further grow the knowledge graph, for example by adding "cites" and "cited by" links between publications (data from CrossRef), and displaying figures from the Biodiversity Literature Repository (BLR).

The demo is here: https://ozymandias-demo.herokuapp.com/ If you’re looking for starting points, you could try:

Assassin spiders (images from Plazi and citation data from CrossRef) https://ozymandias-demo.herokuapp.com/?uri=https://biodiversity.org.au/afd/publication/64908f75-456b-4da8-a82b-c569b4806c22

Screenshot 2018 08 10 17 44

Memoirs of Museum Victoria (dynamic query finds record in Wikidata and adds map) https://ozymandias-demo.herokuapp.com/?uri=https://biodiversity.org.au/afd/publication/5c22a8d1-7456-4f8c-9384-1246ecbf15a6

Screenshot 2018 08 10 17 47

G. R. Allen (we can from the taxonomic tree of his top 20 taxa that he studies fish - who knew?) https://ozymandias-demo.herokuapp.com/?uri=https://biodiversity.org.au/afd/publication/%23creator/g-r-allen

Screenshot 2018 08 10 17 47

Paper on mosquito taxonomy with lots of citations, including material in BHL/BioStor https://ozymandias-demo.herokuapp.com/?uri=https://biodiversity.org.au/afd/publication/578d1dec-5816-49ec-8916-3f957fd230f5

Screenshot 2018 08 10 17 47

Paper on Australian flies with full text in BioStor https://ozymandias-demo.herokuapp.com/?uri=https://biodiversity.org.au/afd/publication/0ffe4f28-b8ac-4132-be34-19eb03fbf685

Screenshot 2018 08 10 17 59

The focus for now is on taxa, publications, journals, and people. Occurrences and sequences are on the “to do” list. As always there’s lots of data cleaning and cross linking to do, but an obvious next step is to link people’s names to identifiers such as ORCID and Wikidata ids, so that we can trace the activities of taxonomists as they discover and describe Australian biodiversity (the choice of Australia is simply to keep things manageable, and because the amount of data and digitisation they’ve done is pretty extraordinary). I’m also working to a deadline as I'm trying to get this demo wrapped up in the next couple of weeks.

Technical details

TL;DR the knowledge graph is implemented as a triple store where the data has been represented using a small number of vocabularies (mostly schema.org with some terms borrowed from TAXREF-LD and the TDWG LSID vocabularies). All results displayed in the first two panels are the result of SPARQL queries, the content in the rightmost panel comes from calls to external APIs. Search is implemented using Elasticsearch. If you are feeling brave you can query the knowledge graph directly in SPARQL. I’m constantly tweaking things and adding data and identifiers, so things are likely to break. More details and documentation will be going up on the GitHub repository.

Wednesday, May 31, 2017

Querying Wikidata

For my own use more than anything else I've started creating a list of Wikidata SPARQL queries here. I personally don't find Wikidata's data model particularly easy to grasp, so one way to learn is to take the example queries on the Wikidata Query site and mess about with them.

For those interested in taxonomic data Wikidata is quite rich in content. For example, you can find the author of a taxonomic names, or find taxon names an author is responsible for creating.

It is also fairly straightforward to search for content by identifier, e.g.

SELECT *
WHERE
{
  ?work wdt:P356 "10.2476/ASJAA.62.33" .
}
will find the article with the DOI 10.2476/ASJAA.62.33. One minor gotcha is that Wikidata has all DOIs in UPPERCASE, so you either need to sera for uppercase version of the DOI, or use a filter to convert the case, which is slow.

As I come across interesting or useful queries I'll add them to the list in GitHub.

Saturday, January 14, 2017

Displaying taxonomic classifications from Wikidata using d3js and SPARQL

Sahelanthropus tchadensis TM 266 01 060 1 Following on from previous posts The Semantic Web made fun: d3sparql and The Biodiversity Heritage Library meets Wikidata via Wikispecies: adding author identifiers to BioStor I've put together an example query that can be used to extract a taxonomic classification from Wikidata. The query is inspired by the http://biohackathon.org/d3sparql/ example, and uses the wikidata property P171 ("parent taxon") which is subproperty of rdfs:subClassOf (the property used in the d3sparql example which queries the Uniprot taxonomy).

The following SPARQL query generates a list of nodes in the tree representing the classification of Hominini (humans, chimps, and their extinct relatives):

PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wd: <http://www.wikidata.org/entity/>
SELECT ?root_name ?parent_name ?child_name WHERE
{
 VALUES ?root_name {"Hominini"}
 ?root wdt:P225 ?root_name .
 ?child wdt:P171+ ?root .
 ?child wdt:P171 ?parent .
 ?child wdt:P225 ?child_name .
 ?parent wdt:P225 ?parent_name .
}

Using https://query.wikidata.org/sparql as the endpoint, in http://biohackathon.org/d3sparql/ this generates the following diagram:

Screenshot 2017 01 14 11 41 55

There are some obvious issues with this classification, such as genera that lack descendant species (e.g., Cyphanthropus). Indeed, we could imagine developing SPARQL queries to flag up such errors (see A use case for RDF in taxonomy). But the availability and accessibility of Wikidata and its SPARQL interface makes it a great playground to explore the utility of SPARQL for exploring taxonomic data.

Wednesday, January 11, 2017

The Biodiversity Heritage Library meets Wikidata via Wikispecies: adding author identifiers to BioStor

I've added an experimental feature to BioStor that uses data from Wikidata and Wikispecies to augment what information BioStor displays on authors. This is a crude first step towards the goal of representing all the data in BioStor as a "knowledge graph" where articles, journals, and authors are all treated as entities, all have identifiers, and we can explore relationships between those entities (e.g., citation, co-authorship, etc.). At the moment this is true of articles, which have Biostor URLs (and in many cases DOIs), and for most journals which are identified by their ISSN. Using identifiers helps reduce ambiguity, especially if there are multiple ways to represent the same thing (e.g., all the alternative ways to write a journal name can be circumvented by using the journal's ISSN).

However, BioStor doesn't have a way to identify authors beyond simply searching for a name. As a first step to tackling this problem I've added a little widget that displays information about an author based on the name you are searching for. For example, searching for George Albert Boulenger will give you a list of publications where the author name is "George Albert Boulenger", as well as a picture of the author and some identifiers (from sources such as VIAF, ISNI, IPNI, and Wikidata):

Screenshot 2017 01 11 16 30 57

For now this widget is independent of the data in BioStor. I don't link an article to its author(s) using identifiers for those authors, nor have I tackled the problem of clustering all the variations in people's names together into one set of names that share the same identifier (see Equivalent author names) nor do I attempt to match names to identifiers (see Reconciling author names using Open Refine and VIAF) other than by an exact text search (for details see below). At this stage I just want to get a sense of what identifiers exist for an author, and what I can learn from those identifiers. I also want to explore the potential of Wikispecies as a source of data on people and publications, and how this relates to Wikidata (for earlier thoughts on using Wikipedia for the same goal see Thoughts on Wikipedia, Wikidata, and the Biodiversity Heritage Library).

Wikispecies

I confess I've never really "got" Wikispecies (e.g., Wikispecies is not a database), it seems to exist in isolation from Wikipedia, which is arguably more informative about many species. But there are a couple of things Wikispecies does very well. Firstly, it is building a rich, crowd-sourced bibliography of papers on the taxonomy of many different species. Readers of iPhylo will recall how many times I've expressed frustration at the nearly evidence-free nature of many online taxonomic databases that simply have lists of names unconnected to the primary literature. Many Wikispecies pages have long lists of papers, making it a potential goldmine. Recently there is a lot of interest in extracting bibliographic data from Wikipedia (see WikiCite). Wikispecies could also be harvested, although a major obstacle any such project faces is the lack of a consistent format for references in Wikispecies.

The other nice thing about Wikipecies is that it has articles on taxonomic authorities, and these often list publications by those authors, and also list external identifiers for those authors, such as the VIAF and ISNI identifiers used in the library world, IPNI and ZooBank identifiers used in taxonomic databases, and ORCID which is becoming the de-facto identifier for academic researchers. This information also ends up in Wikidata.

Using Wikidata to glue things together

Wikidata is an interesting project that, like Wikispecies, I've been in two minds about (see Wikidata, Wikipedia, and #wikisci). However, I've started to make more use of it recently. Inspired by the Wikidata:SPARQL query service/2016 SPARQL Workshop I decided to explore the SPARQL query interface to Wikidata. I was struck by one of the example queries involving Wikispecies, and so after a little bit of messing about came up with a query that takes the name of an author and returns some identifiers from Wikidata, as well as an image of that person if one is available. I restrict the results to people that have an article about them in Wikispecies, because I want start exploring using those articles to make assertions about authorship. Here is a query to search for "George Albert Boulenger":

SELECT *
WHERE
{
  ?item rdfs:label "George Albert Boulenger"@en .
  ?article schema:about ?item .
  ?article schema:isPartOf  .
  OPTIONAL {
   ?item wdt:P213 ?isni .
	}
  OPTIONAL {
   ?item wdt:P214 ?viaf .
	}
  OPTIONAL {
   ?item wdt:P18 ?image .
	}
  OPTIONAL {
   ?item wdt:P496 ?orcid .
	}
  OPTIONAL {
   ?item wdt:P586 ?ipni .
	}
  OPTIONAL {
   ?item wdt:P2006 ?zoobank .
	}
}

This query simply asks whether Wikidata has an item on this person, whether that item is linked to Wikispecies, what identifiers Wikidata has, and whether there is an image of the person. You can see the query "live" here:

I've added some code to BioStor to do this query on the fly, and display the results. So, for Boulenger we get: Screenshot 2017 01 11 17 04 16 Here is the result for noted carcinologist Jocelyn Crane who currently lacks identifiers: Screenshot 2017 01 11 17 05 32 A nice surprise was Bernard Landry: Screenshot 2017 01 11 17 07 14 Note the ORCID 0000-0002-6005-1067. Interestingly, Bernard Landry's ORCID profile doesn't list any publications, whereas we can see lists of these in BioStor and Wikispecies.

Where next?

There are several obstacles to mapping the names of authors to identifiers. One is simply the lack of identifiers. This seems to be rapidly becoming less of a problem with the efforts of the library community around VIAF, the rise of ORCID for living researchers, and the creation of Wikidata items for every taxonomist in Wikispecies. The next challenge is clustering the different ways of writing the same person's name into sets that represent the same person. As discussed above, there are tools for this. Furthermore, with Wikipedia and Wikispecies we have sources of lists of publications linked to a person and their identifiers, which should simplify the task considerably. What is nice about this is that it relies on a crowd-sourcing effort which is already well-established, namely those people who in adding articles to Wikispecies and Wikipedia are created a curated database of publications linked to authors. In many cases those publications are linked to BHL (the source that BioStor extracts its articles from), so many of the links between publications and people are essentially lying there, just waiting for some skilful harvesting.

Thursday, November 24, 2016

The Semantic Web made fun: d3sparql

Screenshot 2016 11 24 10 08 22

Continuing my on-again off-again relationship with the Semantic Web, I stumbled across a cool approach to visualising the results of SPARQL queries. Toshiaki Katayama (@tktym) has put together d3sparql, a set of Javascript scripts that takes SPARQL queries and formats the results graphically using D3.

For example, give the SPARQL endpoint http://togostanza.org/sparql, the following query retrieves the NCBI classification for the tardigrade family Hypsibiidae:

PREFIX rdfs: PREFIX up: SELECT ?root_name ?parent_name ?child_name FROM <http://togogenome.org/graph/uniprot> WHERE { VALUES ?root_name { "Hypsibiidae" } ?root up:scientificName ?root_name . ?child rdfs:subClassOf+ ?root . ?child rdfs:subClassOf ?parent . ?child up:scientificName ?child_name . ?parent up:scientificName ?parent_name . }

By outputting the results as a list of parent-child pairs, it is straightforward to convert the output of this query into a form that D3 accepts, so we can get a tree like this:

HypsibiidaeHebesuncusHebesuncus conjugensHebesuncus ryaniHebesuncus sp. Hebe_06_218Hebesuncus sp. Hebe_06_221DiphasconDiphascon sp. CJS-2007aDiphascon sp. CJS-2007bDiphascon cf. scoticum MC-2011Diphascon (Adropion) sp. MC-2011Diphascon maucciDiphascon puniceumDiphascon sp. Diph_06_114Diphascon sp. Diph_06_147Diphascon sp. Diph_07_008Diphascon sp. Diph_07_168Diphascon sp. Diph_07_169Diphascon sp. Diph_07_176Diphascon alpinumDiphascon sp. F6456Diphascon sp. F6457Diphascon sp. F6458Diphascon sp. F6459Diphascon sp. F6460Diphascon pingueDiphascon belgicaeDiphascon scoticumDiphascon higginsiDiphascon nodulosumDiphascon pataneiDiphascon ramazzottiiDiphascon sp. F7485Diphascon sp. Diph06_146Diphascon sp. Diph07_25Diphascon sp. Diph07_28Diphascon sp. Diph07_29Diphascon sp. Diph07_61Diphascon sp. Diph07_64AcutuncusAcutuncus antarcticusAcutuncus sp. PC-2013HypsibiusHypsibius cf. convergens 1 EK-2007Hypsibius klebelsbergiHypsibius scabropygusHypsibius cf. convergens 2 EK-2007Hypsibius dujardiniHypsibius sp. CJS-2008Hypsibius sp. 'Moon 1997'Hypsibius sp. F7889Hypsibius convergensHypsibius pallidusHypsibius cf. convergens MD-2013BorealibiusBorealibius zetlandicusThuliniusThulinius stephaniaeThulinius sp. JCR-2003Thulinius sp. DVL-2010Thulinius augustiIsohypsibiusIsohypsibius granuliferIsohypsibius cambrensisIsohypsibius asperIsohypsibius prosostomusIsohypsibius papilliferIsohypsibius sp. Tardi_OakIsohypsibius elegansIsohypsibius sp. Tar179Isohypsibius sp. Tar194Isohypsibius sp. Tar195Isohypsibius dastychiHalobiotusHalobiotus crispaeHalobiotus stenostomusRamazzottiusRamazzottius oberhaeuseriRamazzottius cf. oberhaeuseriRamazzottius sp. Rama_07_123Ramazzottius sp. F10349Ramazzottius sp. F10350Ramazzottius sp. F10470Ramazzottius sp. F10471Ramazzottius sp. F10472Ramazzottius sp. F10473Ramazzottius sp. F3679Ramazzottius sp. F3680Ramazzottius sp. F3681Ramazzottius sp. F3682Ramazzottius sp. F3683Ramazzottius sp. F6917Ramazzottius sp. F6918Ramazzottius sp. F6919Ramazzottius sp. F6920Ramazzottius sp. F6921Ramazzottius sp. F6922Ramazzottius varieornatusPseudobiotusPseudobiotus sp. SHR-2005Pseudobiotus kathmanaePseudobiotus megalonyxAstatumenAstatumen trinacriaeEremobiotusEremobiotus alicataiDoryphoribiusDoryphoribius flavusDoryphoribius macrodonItaquasconItaquascon placophorumMixibiusMixibius cf. saracenus MC-2011Mixibius saracenusPlaticristaPlaticrista angustata

The ability to quickly generate trees, charts, and maps from SPARQL queries makes things a lot easier. We can play around a little and explore things. The strength (and challenge) of SPARQL is that it is very open-ended, you can more or less develop queries to do anything. Being able to visualise the results will help guide that exploration.

The code for d3sparql is on GitHub. One "gotcha" is that the cached examples and external Javascript libraries aren't included. I've forked the repository here and added the missing files, so that if you grab that version it works straight out of the box.

Thursday, August 01, 2013

A use case for RDF in taxonomy

RDF Resource Description Framework Icon
Readers of this blog will know that I'm sceptical about the current value of linked data and RDF in biodiversity informatics. But I came across an interesting paper on RDF and biocuration that suggests a good "use case" for RDF in constructing and curating taxonomic databases.

The paper is "Catching inconsistencies with the semantic web: a biocuration case study" (PDF here) by Jerven Bolleman and Sebastien Gehant. The basic idea is that errors in databases (in this case, UniProt) can be flagged by constructing queries in SPARQL that return results if there is a problem (for example if a sequence annotation is contradictory).

In recent posts I've been complaining about errors in the GBIF taxonomy, notably duplicate taxa that are synonyms. One way to tackle this would be to develop a set of SPARQL queries that we could use to flag potential problems. For example, if two names are objective synonyms then only one of them should be a node in the GBIF classification. If both exist then we have a problem. If we know a name is a homonym of an older name, but that name exists in the GBIF classification, then we could flag that as an issue. We could also construct queries that flag possible problems, even if we don't have precise information on synonymy. For example, in this post I noted that several frog species appear twice in the GBIF classification because GBIF has aggregated classifications that put these frogs in different genera. We could catch such cases by constructing a query to check whether the same species name (specific epithet) appeared in different genera within the same family.

The advantage of using RDF and SPARQL in this context is that that the queries are portable. Assuming everyone uses the same vocabulary (e.g., the TDWG LSID vocabularies) then queries can be constructed by one person (e.g., me) and then used by anyone who has their data in a triple store. We could develop a set of "taxonomy tests" that anyone could apply to their database.

This idea needs some more work, but it would be fun to play with some data and see how many kinds of errors or issues we can catch in this way.

Thursday, October 20, 2011

Reflections on the TDWG RDF "Challenge"

This is a follow up to my previous post TDWG Challenge - what is RDF good for? where I'm being, frankly, a pain in the arse, and asking why we bother with RDF? In many ways I'm not particularly anti-RDF, but it bothers me that there's a big disconnect between the reasons we are going down this route and how we are actually using RDF. In other words, if you like RDF and buy the promise of large-scale data integration while still being decentralised ("the web as database"), then we're doing it wrong.

As an aside, my own perspective is one of data integration. I want to link all this stuff together so I can follow a path through multiple datasets and extract the information I want. In other words, "linked data" (little "l", little "d"). I'm interested in fairly light weight integration, typically through shared identifiers. There is also integration via ontologies, which strikes me as a different, if related, problem, that in many ways is closer to the original vision of the Semantic Web as a giant inference engine. I think the concerns (and experience) of these two communities are somewhat different. I don't particularly care about ontologies, I want key-value pairs and reusable identifiers so I can link stuff together. If, for example, you're working on something like Phenoscape, then I think you have a rather more circumscribed set of data, with potentially complicated interrelationships that you want to make inferences on, in which case ontologies are your friend.

So, I posted a "challenge". It wasn't a challenge so much as a set of RDF to play with. What I'm interested in is seeing how easily we can string this data together to learn stuff. For example, using the RDF I posted earlier here is a table listing the name, conservation status, publication DOI and date, and (where available) image from Wikipedia for frogs with sequences in GenBank.

SpeciesStatusDOIYear describedImage
Atelopus nanayCRhttp://dx.doi.org/10.1655/0018-0831(2002)058[0229:TNSOAA]2.0.CO;22002
Eleutherodactylus mariposaCRhttp://dx.doi.org/10.2307/14669621992
Phrynopus kauneorumCRhttp://dx.doi.org/10.2307/15659932002
Eleutherodactylus eunasterCRhttp://dx.doi.org/10.2307/15630101973
Eleutherodactylus amadeusCRhttp://dx.doi.org/10.2307/14455571987
Eleutherodactylus lamprotesCRhttp://dx.doi.org/10.2307/15630101973
Churamiti maridadiCRhttp://dx.doi.org/10.1080/21564574.2002.96354672002
Eleutherodactylus thorectesCRhttp://dx.doi.org/10.2307/14453811988
Eleutherodactylus apostatesCRhttp://dx.doi.org/10.2307/15630101973
Leptodactylus silvanimbusCRhttp://dx.doi.org/10.2307/15636911980
Eleutherodactylus sciagraphusCRhttp://dx.doi.org/10.2307/15630101973
Bufo chavinCRhttp://dx.doi.org/10.1643/0045-8511(2001)001[0216:NSOBAB]2.0.CO;22001
Eleutherodactylus fowleriCRhttp://dx.doi.org/10.2307/15630101973
Ptychohyla hypomykterCRhttp://dx.doi.org/10.2307/36720601993
Hyla suweonensisDDhttp://dx.doi.org/10.2307/14441381980
Proceratophrys concavitympanumDDhttp://dx.doi.org/10.2307/15654122000
Phrynopus bufoidesDDhttp://dx.doi.org/10.1643/CH-04-278R22005
Boophis periegetesDDhttp://dx.doi.org/10.1111/j.1096-3642.1995.tb01427.x1995
Phyllomedusa duellmaniDDhttp://dx.doi.org/10.2307/14446491982
Boophis liamiDDhttp://dx.doi.org/10.1163/1568538033224407722003
Hyalinobatrachium ignioculusDDhttp://dx.doi.org/10.1670/0022-1511(2003)037[0091:ANSOHA]2.0.CO;22003
Proceratophrys cururuDDhttp://dx.doi.org/10.2307/14477121998
Amolops bellulusDDhttp://dx.doi.org/10.1643/0045-8511(2000)000[0536:ABANSO]2.0.CO;22000
Centrolene bacatumDDhttp://dx.doi.org/10.2307/15645281994
Litoria kumaeDDhttp://dx.doi.org/10.1071/ZO030082004
Phrynopus pesantesiDDhttp://dx.doi.org/10.1643/CH-04-278R22005
Gastrotheca galeataDDhttp://dx.doi.org/10.2307/14436171978
Paratelmatobius cardosoiDDhttp://dx.doi.org/10.2307/14479761999
Rhacophorus catamitusDDhttp://dx.doi.org/10.1655/0733-1347(2002)016[0046:NAPKPF]2.0.CO;22002
Huia melasmaDDhttp://dx.doi.org/10.1643/CH-04-137R32005
Telmatobius vilamensisDDhttp://dx.doi.org/10.1655/0018-0831(2003)059[0253:ANSOTA]2.0.CO;22003
Callulina kisiwamsituENhttp://dx.doi.org/10.1670/209-03A2004
Arthroleptis nikeaeENhttp://dx.doi.org/10.1080/21564574.2003.96354862003
Eleutherodactylus amplinymphaENhttp://dx.doi.org/10.1139/z94-2971994
Eleutherodactylus glaphycompusENhttp://dx.doi.org/10.2307/15630101973
Bufo tacanensisENhttp://dx.doi.org/10.2307/14397001952
Phrynopus brackiENhttp://dx.doi.org/10.2307/14458261990
Telmatobius sibiricusENhttp://dx.doi.org/10.1655/0018-0831(2003)059[0127:ANSOTF]2.0.CO;22003
Cochranella macheENhttp://dx.doi.org/10.1655/03-742004
Eleutherodactylus melacaraENhttp://dx.doi.org/10.2307/14669621992
Plectrohyla glandulosaENhttp://dx.doi.org/10.2307/14410461964
Aglyptodactylus laticepsENhttp://dx.doi.org/10.1111/j.1439-0469.1998.tb00775.x1998
Eleutherodactylus glamyrusENhttp://dx.doi.org/10.2307/15656641997
Gastrotheca trachycepsENhttp://dx.doi.org/10.2307/15643751987
Eleutherodactylus grahamiENhttp://dx.doi.org/10.2307/15639291979
Litoria havinaLChttp://dx.doi.org/10.1071/ZO99302251993
Crinia ripariaLChttp://dx.doi.org/10.2307/14407941965
Litoria longirostrisLChttp://dx.doi.org/10.2307/14431591977
Osteocephalus mutaborLChttp://dx.doi.org/10.1163/1568538023208776092002
Leptobrachium nigropsLChttp://dx.doi.org/10.2307/14409661963
Pseudis tocantinsLChttp://dx.doi.org/10.1590/S0101-817519980004000111998
Mantidactylus argenteusLChttp://dx.doi.org/10.1111/j.1096-3642.1919.tb02128.x1919
Aglyptodactylus securiferLChttp://dx.doi.org/10.1111/j.1439-0469.1998.tb00775.x1998
Pseudis cardosoiLChttp://dx.doi.org/10.1163/1568538005072642000
Uperoleia inundataLChttp://dx.doi.org/10.1071/AJZS0791981
Litoria pronimiaLChttp://dx.doi.org/10.1071/ZO99302251993
Litoria paraewingiLChttp://dx.doi.org/10.1071/ZO97602831976
Philautus aurifasciatusLChttp://dx.doi.org/10.1163/156853887X000361987
Proceratophrys avelinoiLChttp://dx.doi.org/10.1163/156853893X001561993
Osteocephalus deridensLChttp://dx.doi.org/10.1163/1568538005075252000
Gephyromantis boulengeriLChttp://dx.doi.org/10.1111/j.1096-3642.1919.tb02128.x1919
Crossodactylus caramaschiiLChttp://dx.doi.org/10.2307/14469071995
Rana yavapaiensisLChttp://dx.doi.org/10.2307/14453381984
Boophis lichenoidesLChttp://dx.doi.org/10.1163/156853898X000251998
Megistolotis lignariusLChttp://dx.doi.org/10.1071/ZO97901351979
Ansonia endauensisNEhttp://dx.doi.org/10.1655/0018-0831(2006)62[466:ANSOAS]2.0.CO;22006
Ansonia kraensisNEhttp://dx.doi.org/10.2108/zsj.22.8092005
Arthroleptella landdrosiaNThttp://dx.doi.org/10.2307/15653592000
Litoria jungguyNThttp://dx.doi.org/10.1071/ZO020692004
Phrynobatrachus phyllophilusNThttp://dx.doi.org/10.2307/15659252002
Philautus ingeriVUhttp://dx.doi.org/10.1163/156853887X000361987
Gastrotheca dendronastesVUhttp://dx.doi.org/10.2307/14450881983
Hyperolius cystocandicansVUhttp://dx.doi.org/10.2307/14439111977
Boophis sambiranoVUhttp://dx.doi.org/10.1080/21564574.2005.96355202005
Ansonia torrentisVUhttp://dx.doi.org/10.1163/156853883X000211983
Telmatobufo australisVUhttp://dx.doi.org/10.2307/15630861972
Stefania coxiVUhttp://dx.doi.org/10.1655/0018-0831(2002)058[0327:EDOSAH]2.0.CO;22002
Oreolalax multipunctatusVUhttp://dx.doi.org/10.2307/15648281993
Eleutherodactylus guantanameraVUhttp://dx.doi.org/10.2307/14669621992
Spicospina flammocaeruleaVUhttp://dx.doi.org/10.2307/14477571997
Cycloramphus acangatanVUhttp://dx.doi.org/10.1655/02-782003
Leiopelma pakekaVUhttp://dx.doi.org/10.1080/03014223.1998.95175541998
Rana okaloosaeVUhttp://dx.doi.org/10.2307/14448471985
Phrynobatrachus uzungwensisVUhttp://dx.doi.org/10.1163/156853883X000301983


This is a small fraction of the frog species actually in GenBank because I've filtered it down to those that have been linked to Wikipedia (from where we get the conservation status) and which were described in papers with DOIs (from which we get the date of description).

I generated this result using this SPARQL query on a triple store that had the primary data sources (Uniprot, Dbpedia, CrossRef, ION) loaded, together with the all-important "glue" datasets that link ION to CrossRef, and Uniprot to Dbpedia (see previous post for details):


PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbpedia-owl: <http://dbpedia.org/ontology/>
PREFIX uniprot: <http://purl.uniprot.org/core/>
PREFIX tdwg_tn: <http://rs.tdwg.org/ontology/voc/TaxonName#>
PREFIX tdwg_co: <http://rs.tdwg.org/ontology/voc/Common#>
PREFIX dcterms: <http://purl.org/dc/terms/>

SELECT ?name ?status ?doi ?date ?thumbnail
WHERE {
?ncbi uniprot:scientificName ?name .
?ncbi rdfs:seeAlso ?dbpedia .
?dbpedia dbpedia-owl:conservationStatus ?status .
?ion tdwg_tn:nameComplete ?name .
?ion tdwg_co:publishedInCitation ?doi .
?doi dcterms:date ?date .

OPTIONAL
{
?dbpedia dbpedia-owl:thumbnail ?thumbnail
}
}
ORDER BY ASC(?status)


This table doesn't tell us a great deal, but we could, for example, graph date of description against conservation status (CR=critical, EN=endangered, VU=vulnerable, NT=not threatened, LC=least concern, DD=data deficient):
Chart
In other words, is it the case that more recently described species are more likely to be endangered than taxa we've known about for some time (based on the assumption that we've found all the common species already)? We could imagine extending this query to retrieve sequences for a class of frog (e.g., critically endangered) so we could compute a measure population genetic variation, etc. We shouldn't take the graph above too seriously because it's based on small fraction of the data, but you get the idea. As more frog taxonomy goes online (there's a lot of stuff in BHL and BioStor, for example) we could add more dates and build a dataset worth analysing properly.

It seems to me that these should be fairly simple things to do, yet they are the sort of thing that if we attempt today it's a world of hurt involving scripts, Excel, data cleaning, etc. before we can do the science.

The thing is, without the "glue" files mapping identifiers across different databases even this simple query isn't possible. Obviously we have no say in how many organisations publish RDF, but within the biodiversity informatics community we should make every effort to use external identifiers wherever possible so that we can make these links. This is the core of my complaint. If we are using RDF to foster data integration so we can query across the diverse data sets that speak to biodiversity, then we are doing it wrong.

Update
Here is a nice visualisation of this dataset from @orovellotti (original here), made using ecoRelevé:

AcNbdh2CMAA3ysc png large