Tuesday, July 24, 2007

I hate computers


The PC hosting linnaeus.zoology.gla.ac.uk and darwin.zoology.gla.ac.uk has died, and this spells the end of my interest in (a) using generic PC hardware and (b) running Linux. The former keeps breaking down, the later is just harder than it needs to be (much as I like the idea). From now on, it's Macs only. No more geeky knapsacks for me.

Because of this crash a lot of my experimental web sites are offline. I'm slowly putting some back up, driven mainly by what my current interests are, what people have asked for, or things linked to by my various blogs. So far, these pages are back online:

A lot of the other stuff will have to wait.

I hate computers...

Tuesday, July 17, 2007

LSID wars

Well, the LSID discussion has just exploded in the last few weeks. I touched on this in my earlier post Rethinking LSIDs versus HTTP URI, but since then the TDWG discussion has become more vigourous (albeit mainly focussed on technical issues, although I suspect that these are symptoms of a larger problem), while public-semweb-lifesci@w3.org list for July has mushroomed into a near slugfest of discussion about URIs, LSIDs, OWL, the goals of the Semantic Web, etc. There are also blog posts, such as Benjamin Good's The main problem with LSIDs, Mark Wilkinson's numerous posts on his blog, and Pierre Lindebaum's commentary.

I have no comment to make on this, I'm merely bookmarking them for when I find the time to wade through all this...

Friday, July 13, 2007

Phyloinformatics Workshop in Edinburgh October 22-24 2007


This October 22-24 there is a phyloinformatics workshop at the e-Science Institute in Edinburgh, Scotland, hosted in conjunction with the Isaac Newton Institute for Mathematical Sciences's Phylogenetics Programme.
As phylogenetics scales up to grapple with the tree of life, new informatics challenges have emerged. Some are essentially algorithmic - the underlying problem of inferring phylogeny is computationally very hard. Large trees not only pose computational problems, but can be hard to visualise and navigate efficiently. Methodological issues abound, such as what is the most efficient way to mine large databases for phylogenetic analysis, and is the "tree of life" the appropriate metaphor given evidence for extensive lateral gene transfer and hybridisation between different branches of the tree. Phylogenies themselves are intrinsically interesting, but their real utility to biologists comes when they are integrated with other data from genomics, geography, stratigraphy, ecology, and development. This poses informatics challenges, ranging from the more general problem of integrating diverse sources of biological data, to how best to store and query phylogenies. Can we express phylogenetic queries using existing database langauges, or is it time for a phylogenetic query language? All these topics can be gathered together under the heading "phyloinformatics". This workshop brings together researchers with backgrounds in biology, computer science, databasing, and mathematics. The aim is to survey the state of the art, present new results, and explore more closely the connections between these topics. The 3 day workshop will consist of 10 talks from invited experts (45 minutes each), plus 3 group discussion sessions (45 mins - 1 hour each). A poster session will be held in the middle of the meeting for investigators who wish to present their results, and there be also be time set aside for additional discussion and interaction.

The invited speakers are:

For more details visit the web site.

Sunday, June 10, 2007

Making taxonomic literature available online

Based on my recent experience developing an OpenURL service (described here, here, and here), linking this to a reference parser and AJAX tool (see David Shorthouse's description of how he did this), and thoughts on XMP, maybe it's time to try and articulate how this could be put together to make taxonomic literature more accessible.

Details below, but basically I think we could make major progress by:

  1. Creating an OpenURL service that knows about as many articles as possible, and has URLs for freely available digital versions of those articles.

  2. Creating a database of articles that the OpenURL service can use.

  3. Creating tools to populate this database, such as bibliography parsers (http://bioguid.info/references is a simple start).

  4. Assigning GUIDs to references.



Background

Probably the single biggest impediment to basic taxonomic research is lack of access to the literature. Why isn't it the case that if I'm looking at a species in a database or a search engine result, I can't click on the original description of that species? (See my earlier grumble about this). Large-scale efforts like the Biodiversity Heritage Library (BHL) will help, but the current focus on old (pre-1923) literature will severely limit the utility of this project. Furthermore, BHL doesn't seem to have dealt with the GUID issue, or the findability issue (how do I know that BHL has a particular paper?).

My own view is that most of this stuff is quite straightforward to deal with, using existing technology and standards, such as OpenURL and SICIs. The major limitation is availability of content, but there is a lot of stuff out there, if we know where to look.

GUIDs

Publications need GUIDs, globally unique identifiers that we can use to identifier papers. There are several kinds of GUID already being used, such as DOIs and Handles. As a general GUID for articles, I've been advocating SICIs.

For example, the Nascimento et al. paper I discussed in an earlier post on frog names has the SICI 0365-4508(2005)63<297>2.0.CO;2-2. This SICI comprises the ISSN number of the serial (in this example 0365-4508 is the ISSN for Arquivos do Museu Nacional, Rio de Janeiro), the year of publication (2005), volume (63), and the starting page (297), plus various other bits of administrivia such as check digits. For most articles this combination of four elements is enough to uniquely define an article.

SICIs can be generated easily, and are free (unlike DOIs). They don't have a resolution mechanism, but one could add support for them to an OpenURL resolver. For more details on SICIs I've some bookmarks on del.icio.us.

Publisher archiving

A number of scientific societies and museums have literature online already, some of which I've made use of already (e.g., the AMNH's Bulletins and Novitates, and the journal Psyche). My OpenURL service knows about some 8000 articles, based on a few days work. But my sense is that there is much more out there. All this needs is some simple web crawling and parsing to build up a database that an OpenURL service can use.

Personal and communal archiving

Another, complementary approach is for people to upload papers in their collection (for example, papers they have authored or have digitised). There are now sites that make this as easy as uploading photos. For example, Scribd is a Flickr-like site where you can upload, tag, and share documents. As a quick test, I uploaded Nascimento et al., which you can see here: http://www.scribd.com/doc/99567/Nascimento-p-297320. Scribd uses Macromedia Flashpaper to display documents.

Many academic institutions have their own archiving programs, such as ePRINTS (see for example ePrints at my own institution).

The trick here is link these to an OpenURL service. Perhaps the taxonomic community should think about a service very like Scribd, which will at the same time update the OpenURL service everytime an article becomes available.

Summary

I suspect that none of this is terribly hard, most of the issues have already been solved, it's just a case of gluing the bits together. I also think it's a case of keeping things simple, and resisting the temptation to make large-scale portals, etc. It's a matter of simple services that can be easily used by lots of people. I this way, I think the imaginative way David Shorthouse made my reference parser and OpenURL service trivial for just abut anybody to use is a model for how we will make progress.

Saturday, June 09, 2007

Rethinking LSIDs versus HTTP URI

The TDWG-GUID mailing list for this month has a discussion of whether TDWG should commit to LSIDs as the GUID of choice. Since the first GUID workshop TDWG has pretty much been going down this route, despite a growing chorus of voices (including mine) that LSIDs are not first class citizens of the Web, and don't play well with the Semantic Web.

Leaving aside political considerations (this stuff needs to be implemented as soon as possible, concerns that if TDWG advocates HTTP URIs people will just treat them as URLs and miss the significance of persistence and RDF, worries that biodiversity will be ghettoised if it doesn't conform what is going on elsewhere), I think there is a way to resolve this that may keep most people happy (or at least, they could live with it). My perspective is driven by trying to separate needs of primary data providers from application developers, and issues of digital preservation.

I'll try and spell out the argument below, but to cut to the chase, I will argue

  1. A GUID system needs to provide a globally unique identifier for an object, and a means of retrieving information about that object.

  2. Any of the current technologies we've discussed (LSIDs, DOIs, Handles) do this (to varying degrees), hence any would do as a GUID.

  3. Most applications that use these GUIDs will use Semantic Web tools, and hence will use HTTP URIs.

  4. These HTTP URIs will be unique to the application, the GUIDs however will be shared

  5. No third party application can serve an HTTP URI that doesn't belong to its domain.

  6. Digital preservation will rely on widely distributed copies of data, these cannot have the same HTTP URI.


From this I think that both parties to this debate are right, and we will end up using both LSIDs and HTTP URIs, and that's OK. Application developers will use HTTP URIs, but will use clients that can handle the various kinds of GUIDs. Data providers will use the GUID technology that is easiest for them to get up and running (for specimen this is likely to be LSIDs, for literature some providers may use Handles via DSpace, some may use URLs).

Individual objects get GUIDs

If individual objects get GUIDs, then this has implications for HTTP URIs. If the HTTP URI is the GUID, an object can only be served from one place. It may be cached elsewhere, but that cached copy can't have the same HTTP URI. Any database that makes use of the HTTP URI cannot serve that HTTP URI itself, it needs to refer to it in some way. This being the case, whether the GUID is a HTTP URI or not starts to look a lot less important, because there is only one place we can get the original data from -- the original data provider. Any application that builds on this data will need it's own identifier if people are going to make use of that application's output.

Connotea as an example

As a concrete example, consider Connotea. This application uses deferenceable GUIDs such as DOIs and Pubmed ids to retrieve publications. DOIs and Pubmed ids are not HTTP URIs, and hence aren't first class citizens of the Web. But Connotea serves its own records as HTTP URIs, and URIs with the prefix "rss" return RDF (like this) and hence can be used "as is" by Semantic Web tools such as Sparql.

If we look at some Connotea RDF, we see that it contains the original DOIs and Pubmed ids.


This means that if two Connotea users bookmark the same paper, we could deduce that they are the same paper by comparing the embedded GUIDs. In the same way, we could combine RDF from Connotea and another application (such as bioGUID) that has information on the same paper. Why not use the original GUIDs? Well, for starters there are two of them (info:pmid/17079492 and info:doi/10.1073/pnas.0605858103) so which to use? Secondly, they aren't HTTP URIs, and if they were we'd go straight to CrossRef or NCBI, not Connotea. Lastly, we loose the important information that the bookmarks are different -- they were made by two different people (or agents).

Applications will use HTTP URIs

We want to go to Connotea (and Connotea wants us to go to it) because it gives us additional information, such as the tags added by users. Likewise, bioGUID adds links to sequences referred to in the paper. Web applications that build on GUIDs want to add value, and need to add value partly because the quality of the original data may suck. For example, metadata provided by CrossRef is limited, DiGIR providers manage to mangle even basic things like dates, and in my experience many records provided by DiGIR sources that lack geocoordinates have, in fact, been georeferenced (based on reading papers about those specimens). The metadata associated with Handles is often appallingly bad, and don't get me started on what utter gibberish GenBank has in its specimen voucher fields.

Hence, applications will want to edit much of this data to correct and improve it, and to make that edited version available they will need their own identifiers, i.e. HTTP URIs. This ranges from social bookmarking tools like Connotea, to massive databases like FreeBase.

Digital durability

Digital preservation is also relevant. How do we ensure our digital records are durable? Well, we can't ensure this (see Clay Shirky's talk at LongNow), but one way to make them more durable is massive redundancy -- multiple copies in many places. Indeed, given the limited functionality of the current GBIF portal, I would argue that GBIFs main role at present is to make specimen data more durable. DiGIR providers are not online 24/7, but if their data are in GBIF those data are still available. Of course, GBIF could not use the same GUID as the URI for that data, like Connotea it would have to store the original GUID in the GBIF copy of the record.

In the same way, the taxonomic literature of ants is unlikely to disappear anytime soon, because a single paper can be in multiple places. For example, Engel et al.'s paper on ants in Cretaceous Amber is available in at least four places:

Which of the four HTTP URIs you can click on should be the GUID for this paper? -- none of them.

LSIDs and the Semantic Web

LSIDs don't play well with the Semantic Web. My feeling is that we should just accept this and move on. I suspect that most users will not interact directly with LSID servers, they will use applications and portals, and these will serve HTTP URIs which are ideal for Semantic Web applications. Efforts to make LSIDs compliant by inserting owl:sameAs statements and rewriting rdf:resource attributes using a HTTP proxy seem to me to be misguided, if for no other reason than one of the strengths of the LSID protocol (no single point of failure, other than the DNS) is massively compromised because if the HTTP proxy goes down (or if the domain name tdwg.org is sold) links between the LSID metadata records will break.

Having a service such as a HTTP proxy that can resolve LSIDs on the fly and rewrite the metadata to become HTTP-resolvable is fine, but to impose an ugly (and possibly short term) hack on the data providers strikes me as unwise. The only reason for attempting this is if we think the original LSID record will be used directly by Semantic web applications. I would argue that in reality, such applications may harvest these records, but they will make them available to others as part of a record with a HTTP URI (see Connotea example).

Conclusions


I think my concerns about LSIDs (and I was an early advocate of LSIDs, see doi:10.1186/1471-2105-6-48) stem from trying to marry them to the Semantic web, which seems the obvious technology for constructing applications to query lots of distributed metadata. But I wonder if the mantra of "dereferenceable identifiers" can sometime get in the way. ISBNs given to books are not, of themselves, dereferenceable, but serve very well as identifiers of books (same ISBN, same book), and there are tools that can retrieve metadata given an ISBN (e.g., LibraryThing).

In a world of multiple GUIDs for the same thing, and multiple applications wanting to talk about the same thing, I think clearly separating identifiers from HTTP URIs is useful. For an application such as Connotea, a data aggregator such GBIF, a database like FreeBase, or a repository like the Internet Archive, HTTP URIs are the obvious choice (If I use a Connotea HTTP URI I want Connotea's data on a particular paper). For GUID providers, there may be other issues to consider.

Note that I'm not saying that we can't use HTTP URIs as GUIDs. In some, perhaps many cases they may well be the best option as they are easy to set up. It's just that I accept that not all GUIDs need be HTTP URIs. Given the arguments above, I think the key thing is to have stable identifiers for which we can retrieve associated metadata. Data providers can focus on providing those, application developers can focus on linking them and their associated metadata together, and repackaging the results for consumption by the cloud.

Thursday, June 07, 2007

Earth not flat - official

Oops. One big problem with drawing trees in Google Earth is that the Earth, sadly, is not flat. This means that widely distributed clades cause problems if I draw straight lines between nodes in the tree. For geographically limited clades (such as the Hawaiian kaytidids shown earlier) this is not really a problem. But for something like plethodontid salamanders (data from TreeBASE study S1139, see doi:10.1073/pnas.0405785101), this is an issue.

One approach is to draw the tree using straight lines connecting nodes, and elevate the tree sufficiently high above the globe so that the lines don't intersect the globe. This is the approach taken by Bill Piel in his server. However, I want to scale the trees by branch length, and hence draw them as phylograms. The screenshot below should make this clearer. Note the salamander Hydromantes italicus in Europe (misspelt in TreeBASE as Hydromantes italucus, but that's another story).



(You can grab the KML file here).

This means that for each internal node in the tree I need to draw a line parallel to the Earth's surface. Oddly, Google Earth seems not to have an easy way to do this. So, we have to do this ourselves. Essentially this requires computing great circles. I managed to find Ed Williams' Aviation Formulary, which gives the necessary equations. The great circle distance d between two points with coordinates {lat1,lon1} and {lat2,lon2} is given by:
d=acos(sin(lat1)*sin(lat2)+cos(lat1)*cos(lat2)*cos(lon1-lon2))

where the latitudes and longitudes are in radians rather than degrees. What we then need is a set of points along the great circle route between the two points. In the following equations f is the fractional distance along the route, where 0 is the first point and 1 is the second point:
A=sin((1-f)*d)/sin(d)
B=sin(f*d)/sin(d)
x = A*cos(lat1)*cos(lon1) + B*cos(lat2)*cos(lon2)
y = A*cos(lat1)*sin(lon1) + B*cos(lat2)*sin(lon2)
z = A*sin(lat1) + B*sin(lat2)
lat=atan2(z,sqrt(x^2+y^2))
lon=atan2(y,x)

So, to draw a tree for any internal node point 1 is the latitude and longitude of the left child, point 2 is the latitude and longitude of the right child, and we place the internal node at the midpoint of the great circle route (f = 0.5). This diagram shows the construction (based on the Astronomy Answers Great Circle page).

To draw the line compute coordinates for f = 0.1, f = 0.2, etc. and join the dots.

The data for the localities for the sequences is slightly tortuous to obtain. The original paper (doi:10.1073/pnas.0405785101) has a spreadsheet listing MVZ specimen numbers, which I then looked up using a Perl script to interrogate the MVZ DiGIR provider. I grabbed the NEXUS file from TreeBASE, built a quick neighbour joining tree, pruned off four taxa that had no geographic coordinates, then drew the tree in KML. Of course, in an ideal world all this should be easy (TreeBASE linked to sequences, which are linked to specimens, which are linked to geography), but for now just being able to make pictures like this is kinda fun.

Tuesday, June 05, 2007

Google Earth phylogenies

Now, for something completely different. I've been playing with Google Earth as a phylogeny viewer, inspired by Bill Piel's efforts, the cool avian flu visualisation Janies et al. published in Systematic Biology (doi:10.1080/10635150701266848), and David Kidd's work.



As an example, I've taken a phylogeny for Banza katydids from Shapiro et al. (doi:10.1016/j.ympev.2006.04.006), and created a KML file. Unlike Bill's trees, I've drawn the tree as a phylogram, because I think biogeography becomes much easier to interpret when we have a time scale (or at least a proxy, such as sequence divergence).



I've converted COI branch lengths to altitude, and elevated the tree off the ground to accomodate the fact that the tips don't all line up (this isn't an ultrametric tree). I then use the extrude style of icon so we can see where exactly the sequence was obtained from.


Wouldn't it be fun to have a collection of molecular trees for Hawaiian taxa for the same gene, plotted on the same Google Earth map? One could imagine all sorts of cool questions one could ask about the kinds of biogeographic patterns displayed (note that Banza doesn't show a simple west-east progression), and the ages of the patterns.

Generating the KML file is fairly straightforward, and if I get time I may add it to my long neglected TreeView X.

Wednesday, May 30, 2007

AMNH, DSpace, and OpenURL

Hate my tribe. Hate them for even asking why nobody uses library standards in the larger world, when “brain-dead inflexibility in practice” is one obvious and compelling reason, and “incomprehensibility” is another.

... $DEITY have mercy, OpenURL is a stupid spec. Great idea, and useful in spite of itself. But astoundingly stupid. Ranganathan preserve us from librarians writing specs! - Caveat Lector


OK, we're on a roll. After adding Journal of Arachnology and Pysche to my OpenURL resolver, I've no added the American Museum of Natural History's Bulletins and Novitates.

In an act of great generosity, the AMNH has placed its publications on a freely accessible DSpace server. This is a wonderful resource provided by one of the world's premier natural history museums (and one others should follow), and is especially valuable given that volumes of the Bulletins and Novitates post 1999 are also hosted by BioOne (and hence have DOIs), but these versions of the publications are not free.

As blogged earlier on SemAnt, getting metadata from DSpace in a actually usable form is a real pain. I ended up writing a script to pull everything off via the OAI interface, extract metadata from the resulting XML, do a DOI look-up for post 1999 material, then dump this into the MySQL server so my OpenURL service can find it.


Apart from the tedium of having to find the OAI interface (why oh why do people make this harder than it needs to be?), the metadata served up by the AMNH is, um, a little ropey. They use Dublin Core, which is great, but the AMNH makes a hash of using it. Dublin Core provides quite a rich set of terms for describing a reference, and guidelines on how to use it. The AMNH uses the same tag for different things. Take date, for example:

<dc:date>2005-10-05T22:02:08Z</dc:date>
<dc:date>2005-10-05T22:02:08Z</dc:date>
<dc:date>1946</dc:date>

Now, one of these dates is the date of publication, the others are dates the metadata was uploaded (or so I suspect). So, why not use the appropriate terms? Like, for instance, <dcterms:created>. Why do I have to parse three fields, and intuit that the third one is the date of publication. Likewise, why have up to three <dc:title> fields, and why include an abbreviated citation in the title? And why for the love of God, format that citation differently for different articles!? Why have multiple <dc:description> fields, one of which is the abstract (and for which <dcterms:abstract> is available?). It's just a mess, and it's very annoying (as you can probably tell). I can see some hate library standards.

Anyway, after much use of Perl regular expressions, and some last minute finessing with Excel, I think we now have the AMNH journals available through OpenURL.

For a demo, go to David Shorthouse's list of references for spiders, say the letter P and click on the bioGUID symbol by a paper by Norm Platnick in the American Museum novitates.

Monday, May 28, 2007

OpenURL and spiders

I'm not sure who said it first, but there's a librarianly spin on the old Perl paradigm I think I heard at code4libcon or Access in the past year: instead of "making simple things simple, and complex things possible," we librarians and those of us librarians who write standards tend, in writing our standards, to "make complex things possible, and make simple things complex."
That approach just won't cut it anymore.
-Dan Chudnov, Rethinking OpenURL


Time to bring some threads together. I've been working on a tool to parse references and find existing identifiers. The tool is at http://bioguid.info/references (for more on my bioGUID project see the blog). Basically, you paste in one or more references, and it tries to figure out what they are, using ParaTools and CrossRef's OpenURL resolver. For example, if you paste in this reference:
Vogel, B. R. 2004. A review of the spider genera Pardosa and Acantholycosa (Araneae, Lycosidae) of the 48 contiguous United States. J. Arachnol. 32: 55-108.

the service tells you that there is a DOI (doi:10.1636/H03-8).



OK, but what if there is no DOI? Every issue of the Journal of Arachnology is online, but only issues from 2000 onwards have DOIs (hosted by my favourite DOI breaker, BioOne). How do I link to the other articles?

One way is using OpenURL. What I've done is add an OpenURL service to bioGUID. If you send it a DOI, it simply redirects you to dx.doi.org to reoslve it. But I've started to expand it to handle papers that I know have no DOI. First up is the Journal of Arachnology. I used SiteSucker to pull all the HTML files listing the PDFs from the journal's web site. Then I ran a Perl script that read each HTML file and pulled out the links. There weren't terribly consistent, there are at least five or six different ways the links are written, but they are consistent enough to parse. What is especially nice is that the URLs include information on volume and starting page number, which greatly simplifies my task. So, this gives me list of over 1000 papers, each with a URL, and for each paper I have the journal, year, volume, and starting page. These four things are enough for me to uniquely identify the article. I then store all this information in a MySQL database, and when a user clicks on the OpenURL link in the list of results from the reference parser, if the journal is the Journal of Arachnology, you go straight to the PDF. Here's one to try.




Yeah, but what else can we do with this? Well, for one thing, you can use the bioGUID OpenURL service in Connotea. On the Advanced settings page you can set an OpenURL resolver. BY default I use CrossRef, but if you put "http://bioguid.info/openurl.php" as the Resolver URL, you will be able get full text for the Journal of Arachnology (providing that you've entered sufficient bibliographic details when saving the reference).

But I think the next step is to have a GUID for each paper, and in the absence of a DOI I'm a favour of SICI's (see my del.icio.us bookmarks for some background). For example, the paper above has the SICI 0161-8202(1988)16<47>2.0.CO;2-0. If this was a resolvable identifier, then we would have unique, stable identifiers for Journal of Arachnology papers that resolve to PDFs. Anybody making links between, say a scientific name and when it was published (e.g., catalogue of Life) could use the SICI as the GUID for the publication.

I need to play more with SICIs (and when I get the chance I'll write a post about that the different bits of a SICI mean), but for now before I forget I'll note that while writing code to generate SICIs for Journal of Arachnology I found a bug in the Perl module Algorithm-CheckDigits-0.44 that I use to compute the checksum for a SICI (the final character in the SICI). The checksum is based on a sum of the characters in the SICI modulo 37, but the code barfs if the sum is exactly divisible by 37 (i.e., the remainder is zero).

Biodiversity Heritage Library blog - look but don't touch

Added the Biodiversity Heritage Library blog to my links on my blog, then noticed that BHL have disabled comments. So, we can view their progress, but can't leave comments. Sigh, I wonder whether BHL has quite grasped that one of the best uses of a blog is to interact with the people who leave comments, in other words, have a conversation. If nothing else, at least it gives you some idea of whether people are actually reading it.
That said, the stuff BHL are putting up on the Internet Archive looks cool. However, how do we find this stuff, and link to it (identifiers anyone?). Wouldn't the BHL blog be a great place to have this conversation...?

Wednesday, May 23, 2007

iTunes, embedded metadata, and DNA barcoding

Continuing on this theme of embedded metadata, this is one reason why DNA barcodingis so appealing. A DNA barcode is rather like embedded metadata -- once we extract it we can look up the sequence and determine the organism's identity (or, at least whether we've seen it before). It's very like identifying a CD based on a hash computed from the track lengths. Traditional identification is more complicated, involves more nebulous data (lets see, my frog has two bumps on the head, gee, are those things in that picture of a frog bumps?), much of which is not online.

Tuesday, May 22, 2007

XMP


Following on from the previous post, as Howison and Goodrum note, Adobe provides XMP as a way to store metadata in files, such as PDFs. XMP supports RDF and namespaces, which means widely used bibliographic standards such as Dublin Core and PRISM can be embedded in a PDF, so the article doesn't become separated from its metadata. Adobe provides a developers kit under a BSD license.

The vision of managing digital papers being as easy as managing digital music is really compelling. Imagine auto populating bibliographic management software simply by adding the PDF! Note also that the Creative Commons supports XMP.

Monday, May 21, 2007

iTunes and citation metadata.

Stumbled across a really nice paper (Why can't I manage academic papers like MP3s?) while reading commentary on Tim O'Reilley's post about FreeBase.



In response to Danny Ayer's post, Tim O'Reilly wrote
I think you miss my point. I wasn't using centralized vs. decentralized as the point of contrast. I was using pre-specified vs. post-specified.

Now, I know you are much closer to all the sem web discussions than I am, and I probably have mischaracterized them somewhat. But you have to ask why are they so widely mischaracterized? There's some fire to go with that smoke.

In a very different context, on a mailing list, Bill Janssen wrote something very apposite:

"Let me recommend the paper "Why can't I manage academic papers like MP3s?", (yes, I realize accepted standards say I should put that comma inside the quotation marks) by James Howison and Abby Goodrum, at http://www.freelancepropaganda.com/archives/MP3vPDF.pdf. The basic thesis is that our common document formats weren't designed for use with digital repositories, and that metadata standards are often designed for the use of librarians and publishers, who have different metadata concerns than end-users have."

That's the distinction between the Semantic Web and Web 2.0 that I was trying to get at.


Howison and Goodrum make some interesting points, especially about how easy it is to get (or create) metadata for a CD, especially when compared to handling academic literature. Charlie Rapple on All My Ey suggests that
While the authors' [Howison and Goodrum] pains have to an extent been resolved since, by online reference management/bookmarking tools such as Connotea or CiteULike (which both launched later that year), and by the increase in XML as a format for online articles (which unites the full text and metadata in one file), their issues with full text availability remain.

I think the pain is still there, especially as Connotea relies on articles having a DOI (or some other URI or identifier). Many articles don't have DOIs. Furthermore, often a paper will have a DOI but that DOI is not printed on the article (either the hard copy of the PDF). This is obviously true for articles that were published before DOIs came in to existence, but which now have DOIs, however it is also the case for some recent articles as well. This means we need to use metadata about the article to try and find a DOI. In contrast, programs like iTunes use databases such as Gracenote CDDB to retrieve metadata for a CD, where the CD's identity is computed based on information on the CD itself (i.e., the track length). The identifier is computed from the object at had.
This one reason why I like SICIs (Serial Item and Contribution Identifier, see my del.icio.us bookmarks for sici for some background). These can be computed from metadata about an individual article, often using just information printed with the article (although the ISSN number might not be). This, coupled with the collaborative nature of CD databases such as CDDB and freedb (users supply missing metadata) makes them a useful illustration for how we might construct a database of taxonomic literature. Users could contribute metadata about papers, with identifiers computed from the papers themselves.

Saturday, May 19, 2007

EoL in the blogsphere


postgenomic is a great way to keep up with science blogs. For example, searching for encyclopedia of life pulls up all sorts of interesting posts. A sampling:

Island of doubt
There is simply no way around this taxonomic deficit. While the EOL won't by itself answer too many questions, by drawing attention to how much work remains before we begin to get a grip on the ecosystems we are already manipulating beyond recognition, maybe, just maybe, we can re-distribute some of our research resources to that less glamorous pursuit known an inventory control.


SciGuy
PRO: Jonathan Fanton, president of the MacArthur Foundation. This is certainly going to advance the science of identification, and the science behind biodiversity.
CON: Dan Graur, a University of Houston professor of biology. I'm skeptical. Some of this knowledge goes back to the 18th century. It's all very nice, but this is not a scientific endeavor, it's an editorial effort. I'm a scientist, I like new knowledge.


My Biotech Life
That flash phylogenetic tree just blew my mind. And the level bar that hides/shows information depending your knowledge level, also cool!


Pharnagula
I don't mean to sound so negative, since I think it's an eminently laudable goal, but I get very, very suspicious when I see all the initial efforts loaded towards building a pretty front end while the complicated core of the project is kept out of focus. I'd be more impressed with something like NCBI Entrez, which, while not as attractive as the EOL mockups, at least starts with the complicated business of integrating multiple databases. I want to see unlovely functionality first, before they try to entice me with a pretty face.


These are not the only blogs, and as always the comments left by others on these blogs is also fascinating. My sense is there is a "wow" factor based on the the publicity, coupled with not inconsiderable skepticism about content.

Friday, May 18, 2007

TBMap paper out

My paper on mapping TreeBASE names to other databases is out as provisional PDF on the BMC Bioinformatics web site (doi:10.1186/1471-2105-8-158 -- not working yet).

The abstract:
TreeBASE is currently the only available large-scale database of published organismal phylogenies. Its utility is hampered by a lack of taxonomic consistency, both within the database, and with names of organisms in external genomic, specimen, and taxonomic databases. The extent to which the phylogenetic knowledge in TreeBASE becomes integrated with these other sources is limited by this lack of consistency.
Taxonomic names in TreeBASE were mapped onto names in the external taxonomic databases IPNI, ITIS, NCBI, and uBio, and graph G of these mappings was constructed. Additional edges representing taxonomic synonymies were added to G, then all components of G were extracted. These components correspond to "name clusters", and group together names in TreeBASE that are inferred to refer to the same taxon. The mapping to NCBI enables hierarchical queries to be performed, which can improve TreeBASE information retrieval by an order of magnitude.
TBMap database provides a mapping of the bulk of the names in TreeBASE to names in external taxonomic databases, and a clustering of those mappings into sets of names that can be regarded as equivalent. This mapping enables queries and visualisations that cannot otherwise be constructed. A simple query interface to the mapping and names clusters is available at: http://linnaeus.zoology.gla.ac.uk/~rpage/tbmap

The TBMap web site needs some work, it's really only intended to document the mapping. Once I've tweaked and updated the mapping, I hope to use it in my forthcoming all-sining, all-dancing, phylogeny database...

Monday, May 14, 2007

More EoL commentary

Lucy Odling-Smee has a short piece on EoL in Nature (doi:10.1038/news070508-7), quoting a certain Page chap as saying
"If it's done well it could be fabulous"

Not the most insightful thing I've ever said. One of the issues Lucy's piece highlights is the long term sustainability of electronic resources like EoL. The whole issue of digital curation is worrying, given the transient nature of many electronic resources.

Friday, May 11, 2007

EoL commentary in Science

Mitch Leslie has written an article on EoL (doi:10.1126/science.316.5826.818). It starts:
Hands up if you've heard this before: An ambitious new project promises to create an online compendium of all 1.8 million or so described species. It can already claim participation by premier institutions, a wad of start-up cash, and huzzahs from biodiversity guru Edward O. Wilson. Although some confess to a wary sense of déjà vu, taxonomists hope that the Encyclopedia of Life (EOL) can provide the long-awaited comprehensive species catalog. Even enthusiasts agree that it faces some tall hurdles, however, such as signing up curators and getting permission to use copyrighted material.

Déjà vu because the defunct All-Species Foundation -- also covered in Science (doi:10.1126/science.294.5543.769) -- had much the same ambitions six years ago. It is easy to be sceptical, but I think it was Rudi Giuliani who said "under promise, over deliver." Wise words.

Thursday, May 10, 2007

David Shorthouse enters the blogsphere


David Shorthouse has entered the blogsphere with his iSpiders blog. As David descrbes it:
This blog will include bits that have fallen off the wagon as it were while developing The Canadian Arachnologist, The Nearctic Spider Database, The Nearctic Arachnologists' Forum and Spider WebWatch. The latter is a citizen science initiative that accepts observation data on 9 ambassador species in North America. I have a strong interest in federating biological data so there will undoubtedly be posts about nomenclatural management, species concepts, data aggregation techniques and the like.

Already some interesting commentary on EoL

ITIS and DOIs


Following on from my earlier grumble about how the catalogue of Life handles literature, I've spent an afternoon mapping publications in the "itis".publications table in a copy of ITIS to external GUIDs, such as DOIs, Handles, and SICIs in JSTOR. The mapping is not complete by any means, but gives an idea of how many publications have GUIDs.You can view the mapping here. Many of the publications in ITIS are books, which don't have DOIs. A lot of the literature is also old (although this doesn't always mean it won't have a DOI).

Of 4296 records, 324 have DOIs (around 7.5%). Not a lot, but a still a reasonable chunk. At least 700 of the ITIS publications are books (based on having an ISBN), so the percentage is a little higher.

The point of this exercise (following on from my comments on the design flaw in the catalogue of life), is that I think taxonomic databases need to use GUIDs internally to maximise their stability and utility.

Indeed, this is another reason to be disappointed with ZooBank. In addition to a poor way to navigate trees (which prompted me to explore tools such PygmyBrowse), ZooBank does exactly what ITIS and the Catalogue of Life do when it comes to displaying literature -- it displays a text citation (albeit with an invitation to view that record in Zoological Record, a subscription-based service).

For example, the copepod Nitocrellopsis texana was described in ITIS publication 3072, which I've discovered has the the DOI doi:10.1023/A:1003892200897. Given a DOI we have a GUID for the publication, and a direct link to it. In contrast, ZooBank merely gives us:
Nitocrellopsis texana n. sp. from central TX (U.S.A.) and N. ahaggarensis n. sp. from the central Algerian Sahara (Copepoda, Harpacticoida). Hydrobiologia 418 (1-3) 15 January: 82

and a link to Zoological Record. Interesting, even with the resources of ISI behind it, the Zoological Record result doesn't have the DOI.

This for me is one reason ZooBank was so disappointing, it actually provided little of value.

What next? Well, with the 300 or so references mapped to DOIs, one could link those to the ITIS records for the corresponding taxonomic names, and serve these up through somehting like iSpecies, for example. These would be links to the literature, in many cases original descriptions, to supplement the other literature found by iSpecies.

Wednesday, May 09, 2007

Catalogue of Life design flaw


A bit more browsing of the Catalogue of Life annual checklist for 2007 reveals a rather annoying feature that, I think, cripples the Catalogue's utility. With each release the checklist grows in size. From their web site:
The Species 2000 & ITIS Catalogue of Life is planned to become a comprehensive catalogue of all known species of organisms on Earth by the year 2011. Rapid progress has been made recently and this, the seventh edition of the Annual Checklist, contains 1,008,965 species.

However, with each release the identifiers for each taxon change. For example, if I were to link to the record for the peacrab Pinnotheres pisum this year (2007), I would link to record 3803555, but last year I would have linked to 872170. Record 872170 no longer exists in the 2007 edition.

So, what would a user who based their taxonomic database on the Catalogue of Life do? All their links would break (not just because the URL interface has changed, but the underlying identifiers have changed as well). It's as if the authors of the catalogue have been oblivious to the discussion on globally unique identifiers (GUIDs) and the need for stable, persistent identifiers.

Anybody building a database that gets updated, and possible rebuilt needs to thik about how their identifiers will change. If identifiers are simply the primary keys in a table, then they will likely be unstable, unless great care is taken. Althernatively, databases that are essentially aggregations of data available elsewhere could use GUIDs as the primary keys. This means that even if the database is restructured, the keys (and hence the identifiers) don't change. For the user, everything still works.

Despite the favourable press about its progress (e.g., doi:10.1038/news050314-6, Environmental Research Web, and CNN), I think the catalogue needs some serious rethinking if it is to be genuinely useful. For more on this, see my earlier posting on how the catalogue handles literature.

Image of Pinnotheres pisum by Hans Hillewaert obtained from Wikimedia Commons.