Sunday, June 10, 2007

Making taxonomic literature available online

Based on my recent experience developing an OpenURL service (described here, here, and here), linking this to a reference parser and AJAX tool (see David Shorthouse's description of how he did this), and thoughts on XMP, maybe it's time to try and articulate how this could be put together to make taxonomic literature more accessible.

Details below, but basically I think we could make major progress by:

  1. Creating an OpenURL service that knows about as many articles as possible, and has URLs for freely available digital versions of those articles.

  2. Creating a database of articles that the OpenURL service can use.

  3. Creating tools to populate this database, such as bibliography parsers (http://bioguid.info/references is a simple start).

  4. Assigning GUIDs to references.



Background

Probably the single biggest impediment to basic taxonomic research is lack of access to the literature. Why isn't it the case that if I'm looking at a species in a database or a search engine result, I can't click on the original description of that species? (See my earlier grumble about this). Large-scale efforts like the Biodiversity Heritage Library (BHL) will help, but the current focus on old (pre-1923) literature will severely limit the utility of this project. Furthermore, BHL doesn't seem to have dealt with the GUID issue, or the findability issue (how do I know that BHL has a particular paper?).

My own view is that most of this stuff is quite straightforward to deal with, using existing technology and standards, such as OpenURL and SICIs. The major limitation is availability of content, but there is a lot of stuff out there, if we know where to look.

GUIDs

Publications need GUIDs, globally unique identifiers that we can use to identifier papers. There are several kinds of GUID already being used, such as DOIs and Handles. As a general GUID for articles, I've been advocating SICIs.

For example, the Nascimento et al. paper I discussed in an earlier post on frog names has the SICI 0365-4508(2005)63<297>2.0.CO;2-2. This SICI comprises the ISSN number of the serial (in this example 0365-4508 is the ISSN for Arquivos do Museu Nacional, Rio de Janeiro), the year of publication (2005), volume (63), and the starting page (297), plus various other bits of administrivia such as check digits. For most articles this combination of four elements is enough to uniquely define an article.

SICIs can be generated easily, and are free (unlike DOIs). They don't have a resolution mechanism, but one could add support for them to an OpenURL resolver. For more details on SICIs I've some bookmarks on del.icio.us.

Publisher archiving

A number of scientific societies and museums have literature online already, some of which I've made use of already (e.g., the AMNH's Bulletins and Novitates, and the journal Psyche). My OpenURL service knows about some 8000 articles, based on a few days work. But my sense is that there is much more out there. All this needs is some simple web crawling and parsing to build up a database that an OpenURL service can use.

Personal and communal archiving

Another, complementary approach is for people to upload papers in their collection (for example, papers they have authored or have digitised). There are now sites that make this as easy as uploading photos. For example, Scribd is a Flickr-like site where you can upload, tag, and share documents. As a quick test, I uploaded Nascimento et al., which you can see here: http://www.scribd.com/doc/99567/Nascimento-p-297320. Scribd uses Macromedia Flashpaper to display documents.

Many academic institutions have their own archiving programs, such as ePRINTS (see for example ePrints at my own institution).

The trick here is link these to an OpenURL service. Perhaps the taxonomic community should think about a service very like Scribd, which will at the same time update the OpenURL service everytime an article becomes available.

Summary

I suspect that none of this is terribly hard, most of the issues have already been solved, it's just a case of gluing the bits together. I also think it's a case of keeping things simple, and resisting the temptation to make large-scale portals, etc. It's a matter of simple services that can be easily used by lots of people. I this way, I think the imaginative way David Shorthouse made my reference parser and OpenURL service trivial for just abut anybody to use is a model for how we will make progress.

Saturday, June 09, 2007

Rethinking LSIDs versus HTTP URI

The TDWG-GUID mailing list for this month has a discussion of whether TDWG should commit to LSIDs as the GUID of choice. Since the first GUID workshop TDWG has pretty much been going down this route, despite a growing chorus of voices (including mine) that LSIDs are not first class citizens of the Web, and don't play well with the Semantic Web.

Leaving aside political considerations (this stuff needs to be implemented as soon as possible, concerns that if TDWG advocates HTTP URIs people will just treat them as URLs and miss the significance of persistence and RDF, worries that biodiversity will be ghettoised if it doesn't conform what is going on elsewhere), I think there is a way to resolve this that may keep most people happy (or at least, they could live with it). My perspective is driven by trying to separate needs of primary data providers from application developers, and issues of digital preservation.

I'll try and spell out the argument below, but to cut to the chase, I will argue

  1. A GUID system needs to provide a globally unique identifier for an object, and a means of retrieving information about that object.

  2. Any of the current technologies we've discussed (LSIDs, DOIs, Handles) do this (to varying degrees), hence any would do as a GUID.

  3. Most applications that use these GUIDs will use Semantic Web tools, and hence will use HTTP URIs.

  4. These HTTP URIs will be unique to the application, the GUIDs however will be shared

  5. No third party application can serve an HTTP URI that doesn't belong to its domain.

  6. Digital preservation will rely on widely distributed copies of data, these cannot have the same HTTP URI.


From this I think that both parties to this debate are right, and we will end up using both LSIDs and HTTP URIs, and that's OK. Application developers will use HTTP URIs, but will use clients that can handle the various kinds of GUIDs. Data providers will use the GUID technology that is easiest for them to get up and running (for specimen this is likely to be LSIDs, for literature some providers may use Handles via DSpace, some may use URLs).

Individual objects get GUIDs

If individual objects get GUIDs, then this has implications for HTTP URIs. If the HTTP URI is the GUID, an object can only be served from one place. It may be cached elsewhere, but that cached copy can't have the same HTTP URI. Any database that makes use of the HTTP URI cannot serve that HTTP URI itself, it needs to refer to it in some way. This being the case, whether the GUID is a HTTP URI or not starts to look a lot less important, because there is only one place we can get the original data from -- the original data provider. Any application that builds on this data will need it's own identifier if people are going to make use of that application's output.

Connotea as an example

As a concrete example, consider Connotea. This application uses deferenceable GUIDs such as DOIs and Pubmed ids to retrieve publications. DOIs and Pubmed ids are not HTTP URIs, and hence aren't first class citizens of the Web. But Connotea serves its own records as HTTP URIs, and URIs with the prefix "rss" return RDF (like this) and hence can be used "as is" by Semantic Web tools such as Sparql.

If we look at some Connotea RDF, we see that it contains the original DOIs and Pubmed ids.


This means that if two Connotea users bookmark the same paper, we could deduce that they are the same paper by comparing the embedded GUIDs. In the same way, we could combine RDF from Connotea and another application (such as bioGUID) that has information on the same paper. Why not use the original GUIDs? Well, for starters there are two of them (info:pmid/17079492 and info:doi/10.1073/pnas.0605858103) so which to use? Secondly, they aren't HTTP URIs, and if they were we'd go straight to CrossRef or NCBI, not Connotea. Lastly, we loose the important information that the bookmarks are different -- they were made by two different people (or agents).

Applications will use HTTP URIs

We want to go to Connotea (and Connotea wants us to go to it) because it gives us additional information, such as the tags added by users. Likewise, bioGUID adds links to sequences referred to in the paper. Web applications that build on GUIDs want to add value, and need to add value partly because the quality of the original data may suck. For example, metadata provided by CrossRef is limited, DiGIR providers manage to mangle even basic things like dates, and in my experience many records provided by DiGIR sources that lack geocoordinates have, in fact, been georeferenced (based on reading papers about those specimens). The metadata associated with Handles is often appallingly bad, and don't get me started on what utter gibberish GenBank has in its specimen voucher fields.

Hence, applications will want to edit much of this data to correct and improve it, and to make that edited version available they will need their own identifiers, i.e. HTTP URIs. This ranges from social bookmarking tools like Connotea, to massive databases like FreeBase.

Digital durability

Digital preservation is also relevant. How do we ensure our digital records are durable? Well, we can't ensure this (see Clay Shirky's talk at LongNow), but one way to make them more durable is massive redundancy -- multiple copies in many places. Indeed, given the limited functionality of the current GBIF portal, I would argue that GBIFs main role at present is to make specimen data more durable. DiGIR providers are not online 24/7, but if their data are in GBIF those data are still available. Of course, GBIF could not use the same GUID as the URI for that data, like Connotea it would have to store the original GUID in the GBIF copy of the record.

In the same way, the taxonomic literature of ants is unlikely to disappear anytime soon, because a single paper can be in multiple places. For example, Engel et al.'s paper on ants in Cretaceous Amber is available in at least four places:

Which of the four HTTP URIs you can click on should be the GUID for this paper? -- none of them.

LSIDs and the Semantic Web

LSIDs don't play well with the Semantic Web. My feeling is that we should just accept this and move on. I suspect that most users will not interact directly with LSID servers, they will use applications and portals, and these will serve HTTP URIs which are ideal for Semantic Web applications. Efforts to make LSIDs compliant by inserting owl:sameAs statements and rewriting rdf:resource attributes using a HTTP proxy seem to me to be misguided, if for no other reason than one of the strengths of the LSID protocol (no single point of failure, other than the DNS) is massively compromised because if the HTTP proxy goes down (or if the domain name tdwg.org is sold) links between the LSID metadata records will break.

Having a service such as a HTTP proxy that can resolve LSIDs on the fly and rewrite the metadata to become HTTP-resolvable is fine, but to impose an ugly (and possibly short term) hack on the data providers strikes me as unwise. The only reason for attempting this is if we think the original LSID record will be used directly by Semantic web applications. I would argue that in reality, such applications may harvest these records, but they will make them available to others as part of a record with a HTTP URI (see Connotea example).

Conclusions


I think my concerns about LSIDs (and I was an early advocate of LSIDs, see doi:10.1186/1471-2105-6-48) stem from trying to marry them to the Semantic web, which seems the obvious technology for constructing applications to query lots of distributed metadata. But I wonder if the mantra of "dereferenceable identifiers" can sometime get in the way. ISBNs given to books are not, of themselves, dereferenceable, but serve very well as identifiers of books (same ISBN, same book), and there are tools that can retrieve metadata given an ISBN (e.g., LibraryThing).

In a world of multiple GUIDs for the same thing, and multiple applications wanting to talk about the same thing, I think clearly separating identifiers from HTTP URIs is useful. For an application such as Connotea, a data aggregator such GBIF, a database like FreeBase, or a repository like the Internet Archive, HTTP URIs are the obvious choice (If I use a Connotea HTTP URI I want Connotea's data on a particular paper). For GUID providers, there may be other issues to consider.

Note that I'm not saying that we can't use HTTP URIs as GUIDs. In some, perhaps many cases they may well be the best option as they are easy to set up. It's just that I accept that not all GUIDs need be HTTP URIs. Given the arguments above, I think the key thing is to have stable identifiers for which we can retrieve associated metadata. Data providers can focus on providing those, application developers can focus on linking them and their associated metadata together, and repackaging the results for consumption by the cloud.

Thursday, June 07, 2007

Earth not flat - official

Oops. One big problem with drawing trees in Google Earth is that the Earth, sadly, is not flat. This means that widely distributed clades cause problems if I draw straight lines between nodes in the tree. For geographically limited clades (such as the Hawaiian kaytidids shown earlier) this is not really a problem. But for something like plethodontid salamanders (data from TreeBASE study S1139, see doi:10.1073/pnas.0405785101), this is an issue.

One approach is to draw the tree using straight lines connecting nodes, and elevate the tree sufficiently high above the globe so that the lines don't intersect the globe. This is the approach taken by Bill Piel in his server. However, I want to scale the trees by branch length, and hence draw them as phylograms. The screenshot below should make this clearer. Note the salamander Hydromantes italicus in Europe (misspelt in TreeBASE as Hydromantes italucus, but that's another story).



(You can grab the KML file here).

This means that for each internal node in the tree I need to draw a line parallel to the Earth's surface. Oddly, Google Earth seems not to have an easy way to do this. So, we have to do this ourselves. Essentially this requires computing great circles. I managed to find Ed Williams' Aviation Formulary, which gives the necessary equations. The great circle distance d between two points with coordinates {lat1,lon1} and {lat2,lon2} is given by:
d=acos(sin(lat1)*sin(lat2)+cos(lat1)*cos(lat2)*cos(lon1-lon2))

where the latitudes and longitudes are in radians rather than degrees. What we then need is a set of points along the great circle route between the two points. In the following equations f is the fractional distance along the route, where 0 is the first point and 1 is the second point:
A=sin((1-f)*d)/sin(d)
B=sin(f*d)/sin(d)
x = A*cos(lat1)*cos(lon1) + B*cos(lat2)*cos(lon2)
y = A*cos(lat1)*sin(lon1) + B*cos(lat2)*sin(lon2)
z = A*sin(lat1) + B*sin(lat2)
lat=atan2(z,sqrt(x^2+y^2))
lon=atan2(y,x)

So, to draw a tree for any internal node point 1 is the latitude and longitude of the left child, point 2 is the latitude and longitude of the right child, and we place the internal node at the midpoint of the great circle route (f = 0.5). This diagram shows the construction (based on the Astronomy Answers Great Circle page).

To draw the line compute coordinates for f = 0.1, f = 0.2, etc. and join the dots.

The data for the localities for the sequences is slightly tortuous to obtain. The original paper (doi:10.1073/pnas.0405785101) has a spreadsheet listing MVZ specimen numbers, which I then looked up using a Perl script to interrogate the MVZ DiGIR provider. I grabbed the NEXUS file from TreeBASE, built a quick neighbour joining tree, pruned off four taxa that had no geographic coordinates, then drew the tree in KML. Of course, in an ideal world all this should be easy (TreeBASE linked to sequences, which are linked to specimens, which are linked to geography), but for now just being able to make pictures like this is kinda fun.

Tuesday, June 05, 2007

Google Earth phylogenies

Now, for something completely different. I've been playing with Google Earth as a phylogeny viewer, inspired by Bill Piel's efforts, the cool avian flu visualisation Janies et al. published in Systematic Biology (doi:10.1080/10635150701266848), and David Kidd's work.



As an example, I've taken a phylogeny for Banza katydids from Shapiro et al. (doi:10.1016/j.ympev.2006.04.006), and created a KML file. Unlike Bill's trees, I've drawn the tree as a phylogram, because I think biogeography becomes much easier to interpret when we have a time scale (or at least a proxy, such as sequence divergence).



I've converted COI branch lengths to altitude, and elevated the tree off the ground to accomodate the fact that the tips don't all line up (this isn't an ultrametric tree). I then use the extrude style of icon so we can see where exactly the sequence was obtained from.


Wouldn't it be fun to have a collection of molecular trees for Hawaiian taxa for the same gene, plotted on the same Google Earth map? One could imagine all sorts of cool questions one could ask about the kinds of biogeographic patterns displayed (note that Banza doesn't show a simple west-east progression), and the ages of the patterns.

Generating the KML file is fairly straightforward, and if I get time I may add it to my long neglected TreeView X.