Tuesday, February 26, 2008

Encyclopedia of Life - first impressions

Some thoughts on the first release of the Encyclopedia of Life. I am being deliberately critical. This is a high profile project with tens of millions of dollars in funding, lots of people involved, and is accompanied by some of the most overblown hype in organismal biology. In a sense I think EOL has set itself up by over promising and under delivering.

Before continuing, I should point out that I am involved in EOL in an advisory capacity, but not in actually making anything. Some of the tools I've blogged about have made there way into EOL, such as Pygmybrowse and reference parsing (see David Shorthouse's excellent work on this).

Lack of content
I think the first release of EOL should have, at a minimum, provided at least as much information that I can get from iSpecies and Wikipedia. Other projects, such as Freebase, have pre-populated their databases with content from Wikipedia and other sources. Why didn't EOL? If the argument is that they want authenticated content, then this doesn't wash. Their authenticated content is minimal, and waiting for authentication will, in my view, cripple EOL.

Exemplars are incomplete
The first release contains 25 exemplars. Pages for these taxa
...show the kind of rich environment, with extensive information, to which all the species pages will eventually grow. The information on the exemplar pages has been authenticated (endorsed) by the scientists whose names are listed on these pages.
Well, I hope this isn't the standard EOL aspires to. The pages are incomplete and not interlinked. One of the 25 chosen exemplars is Anolis carolinensis. EOL lists its distribution as:
Widely-distributed throughout the southeastern United States: North Carolina to Key West, Florida, and west to southest Oklahoma and central Texas.

However, the GBIF map EOL displays shows lots of dots in Hawaii:

The EOL account is silent on this interesting distribution pattern. It will come as no suprise that the Wikipedia account of the same species tells us that it has been introduced into Hawaii. Wikipedia 1, EOL 0.


If two pages talk about species that are ecologically associated, then surely those pages should be linked? Among the exemplars is Pissodes strobi, the white pine weevil. In the EOL account, among the hosts listed is Pinus strobus, another exemplar taxon. The accounts of these two taxa are not linked. No hyperlink, nothing. The reader has no idea that there is an exemplar account for Pinus strobus. Furthermore, when reading the account for Pinus strobus there is no indication that it is host to the white pine weevil.
Surely the point of having all this information in one place is so that it can be linked together?

EOL also exposes some limitations of the Biodiversity heritage Library. Consider the exemplar page for Pinus strobus L. The "L." indicates that this species was described by Linnaeus. Among the many references listed by BHL, none are by Linnaeus. What gives?

Well, the IPNI record reveals that this species was described on p. 1001 of Species Plantarum. BHL has digitised Species Plantarum, and page 1001 has Pinus strobus:

Now, BHL relies on uBio's tools to extract names, and Linnaeus didn't make this easy (the specific epithet strobus is in the right hand margin, separate from Pinus), but one would have thought that for the exemplar taxa an effort would have been made to link Linnaean names to BHL content -- what better place to showcase the link between a name and its publication? It's quite easy to do, given that IPNI has page numbers for plant names. Just map page numbers to BHL URLs, and you're done.

Going down the taxonomic hierarchy weird things happen. When viewing the plant genus Morus if I can see a picture of Morus nigra (presumably this is "authenticated" content). If I drill down to the species Morus nigra, I'm told there is no authenticated content for this species. Either the image is Morus nigra or it isn't. If it is, why not show it, if it isn't, why claim that it is?


Way too much space is devoted to logos of various contributors, BHL being the worst offender (it doesn't help that the BHL content is incomplete, lacking links for Linnaean names). I don't care about logos. Contributors may care about getting their logos displayed, but users couldn't care less. They get in the way. On some pages, there's more screen space devoted to logos than information (e.g., the page for Apomys datae). This is, frankly, ridiculous, and reflects a warped set of priorities.

What's worse, all these logos are associated with links that take people away from EOL. Hence EOL becomes little more than a collection of web links to other sites.

The search is based on the Catalogue of Life, and inherits the same problems. For example, if I search for "Morus" I get a list in alphabetical order of taxonomic names that contain the string "morus". The two names that are an exact match occur as items three and four on the list -- they should be first and second.

It gets worse if I search on "Tyrannosaurus rex". EOL doesn't do dinosaurs, and so doesn't contain anything on T. rex, but the search results tell me that The following 116 search results contain 'Tyrannosaurus rex'. Nope, none of them do.

The search engine is poorly done, it fails to rank results sensibly, incorrectly reports what it does find, and has no support for spelling mistakes.

Authenticated content
This is probably the thing that, if left as it is, will strangle EOL. The insistence on "authenticated (endorsed)" content places a severe brake on what EOL can offer.

It's a web site
EOL's web site has no mechanism for people to extract data (e.g., RSS feeds, microformats, links to RDF, etc.). It's intended to be read by humans, not machines. This greatly diminishes its utility.

So, I've got that off my chest. The first release was always going to be a disappointment, especially given the hype. What frustrates me, however, is just how far the first release is from what it could have been.

The real question is how much the issues I've raised are things which are easy to fix given time, or whether they reflect underlying problems with the way the project is conceived.

EOL live

The first release of the Encyclopedia of Life is officially live today. I have promised to be very good...

Monday, February 18, 2008

LSID Tester, a tool for testing Life Science Identifier resolution services

My short note on the LSID Tester tool has been published in the Open Access journal Source Code for Biology and Medicine. The article has just come out so the DOI (doi:10.1186/1751-0473-3-2) isn't live yet, the direct link is http://www.scfbm.org/content/3/1/2/. Source code for the tester is available from Google Code.

TBMap errors

In the absence of a proper bug reporting system, I'm going to use this post to collect errors in the TBMap project, which maps taxonomic names in TreeBASE onto names in other databases.

T57654LycorideaeErroneously agrep matched to the spider family Lycosidae, this is a plant tribe.
T56449Ficus uncinatabad agrep to Pinus uncinata

Sunday, February 17, 2008

CrossRef blogger tool for DOI lookup

CrossRef have released a tool for bloggers to look up DOIs and insert them into blog posts:
The plug-in, which is available for download at: https://sourceforge.net/projects/crossref-cite/, allows the blogger to use a widget-based interface to search CrossRef metadata using citations or partial citations. The results of the search, with multiple hits, are displayed and the author can then either click on a hit to follow the DOI to the publisher’s site, or click on an icon next to the hit to insert the citation into their blog entry (as either a full citation or as a short “op. cit.”).

So far the tool is only available for WordPress blogs. The idea is that bloggers can use DOIs to uniquely identify papers that they are discussing, while at the same time providing readers with an easy way to go to the site hosting the article, and aggregators such as postgenomic.com can cluster posts about the same paper.
Whilst Googling for reaction, I came across various posts extolling the virtues of OpenURL versus DOIs, or proposing alternative identifiers (DOI or DOH? Proposal for a RESTful unique identifier for papers, and PaperID - An Open Source Identifier for Research Papers). Personally I think much of this discussion focusses on identifiers, when it's the services built on those identifiers that really matter.

Tuesday, February 05, 2008

The Data Wars

Wired 16.01 has an article entitled The Data Wars by Josh McHugh. A quote from the printed version:
They call it scraping — when web companies automatically harvest information from the likes of Yahoo, Google, and craigslist. Now the Internet establishment is clamping down.

It's a sobering read for those of us who advocate harvesting data from as many sources as possible, more so in light of Microsoft's bid to buy Yahoo. Yahoo provides free access to many of its tools via an API (such as the image search I use in iSpecies, and in this sense is much more open than Google. Might this change under Microsoft...?

Monday, February 04, 2008

How to visualize a phylogeny with thousands of tips?

Dave Lunt has a nice post on How to visualize a phylogeny with thousands of tips?. Dave lists 12 things that his ideal phylogenetic tree viewing tool should do, and invites comments. It will be interesting to see what comes of this...

Incomplete citation and ranking

Came across the paper "Using incomplete citation data for MEDLINE results ranking" (pmid:16779053, fulltext available in PMC .The authors applied PageRank (the algorithm Google use to rank search results) to papers in MEDLINE and found that PageRank is robust to information loss. In other words, even if a citation database is incomplete it will do a good job of ranking results. This is encouraging, as I'm keen to use this approach to rank both papers and other objects (e.g., sequences and specimens), and will almost certainly never have a complete citation list.

A database of everything

Nothing like a little hubris first thing Monday morning...

After various experiments, such as a triple store for ants (documented on the Semant blog) and bioGUID (documented on the bioGUID blog), I'm starting from scratch and working on a "database of everything". Put another way, I'm working on a database that aggregates metadata about specimens, sequences, literature, images, taxonomic names, etc. But beyond "merely" aggregating data I'm really interested in linking the data. Here are some design decisions I've made so far.

The first is that the database structure used to store metadata uses the Entity–Attribute–Value (EAV) model (for some background see the papers I've bookmarked on Connotea with the tag entity–attribute–value). This tends to send traditional database managers into fits, but for my purposes (storing objects of different kinds with frequently sparse metadata) it seems a natural choice. It's a little bit like a triple store, but I find triple stores frustrating when creating a database. Given the often low quality of the incoming data a lot of post-processing may be needed. Having the data easily accessible makes this task easier -- I view triple stores as "read only" databases. The data I store will be made available as RDF, so users could still use Semantic Web tools to analyse it.

Each object stored in the database gets an identifier based on a md5 hash of one of it's GUIDs (e.g., PubMed identifier, DOI, URL), a technique used by del.icio.us and Connotea (for example, the md5 hash of "http://dx.doi.org/10.2196/jmir.5.4.e27" is d00cf429c001c3c7ae4f2d730718dcc8, hence the Connotea URI for this article is http://www.connotea.org/article/d00cf429c001c3c7ae4f2d730718dcc8). The URIs for the objects in my database will also use these md5 hashes. The original GUIDs (such as DOIs, etc.) are stored within the database.

Much of the programming for this project involves retrieving metadata associated with an identifier. Taking the code I wrote for bioGUID as a starting point, I've added a lot of code to try and clean up the metadata (fixing dates, extracting latitude and longitudes and such like, see Metacrap), but also to extract additional identifiers from text. The database is fundamentally about these links between objects. Typically I write a client for a service that returns metadata for an identifier (typically in XML), transform it to JSON, then post process it.

Another decision is that names of genes and taxa are treated as tags, postponing any decision about what the names actually refer to, as well as whether two or more tags refer to the same thing. The expectation is that much of this can be worked out after the fact (in much the same way as I did for the TbMap project).

The database is being populated in two ways, spidering and bulk upload. Spidering involves taking an identifier, finding any linked identifiers, and resolving those. For example, a PubMed identifier may link to a set of nucleotide sequences, many of which may link to a specimen code. Bulk upload involves harvesting large numbers of records and adding them to the database.

The initial focus is on adding records related to TreeBASE, with a view to creating an annotated version of that database such that a user could ask questions like "find me all phylogenies in TreeBASE that include taxa from Australia."