iPhylo: Encylcopedia of Life

Roderic D. M. Page

Showing posts with label Encylcopedia of Life. Show all posts

Sunday, April 25, 2010

Time for some decent service

The BBC web site has an article entitled Giant deep sea jellyfish filmed in Gulf of Mexico which has footage of Stygiomedusa gigantea, and mentions an associated fish, Thalassobathia pelagica.

One thing that frustrates me beyond belief is how hard it is to get more information about these organisms. Put another way, the biodiversity informatics community is missing a huge opportunity here. There are a slew of services, such as Zemanta and OpenCalais.com, that can enrich the content of a document by identifying terms and adding links. Imagine a similar service that took taxonomic names and could provide information and links about that name, so that sites such as the BBC could enrich their pages. We've had various attempts at this¹, but we are still far from creating something genuinely useful.

Part of the problem is that the plethora of taxonomic databases we have are often of little use. After fussing with Google I discover that Stygiomedusa gigantea (Browne, 1910) has the synonym Stygiomedusa fabulosa Russell, 1959 (see, e.g., the WoRMS database), but no database tells me that the genus Stygiomedusa was published by Russell in Nature in 1959 (doi:10.1038/1841527a0). Nor can I readily find the original reference for (Browne, 1910) in these databases². Why is this so hard?

Then when we do have information, we fail to make it digestible. For example, the EOL page for Thalassobathia pelagica links to BHL pages, but fails to point out that the pages it links belong to a single article, and that this article (http://biostor.org/reference/4339) is the original description of the fish.

Publishers are increasingly interested in any tools that can embellish their content. The organisation that gets their act together and provides a decent service for publishers (including academic journals, and news services such as the BBC) is going to own this space. Any takers...?

Such as uBio LinkIT and EOL NameLink.
After finding another taxon with the author Browne 1910 in BHL, I found Diplulmaris (?) gigantea, which looked like a good candidate for the original name for the jellyfish, see http://biodiversitylibrary.org/page/1727009. This is confirmed by the Smithsonian's Antarctic Invertebrates site.

Tuesday, February 02, 2010

EOL, the BBC, and Wikipedia

Last month EOL took the brave step of including Wikipedia content in its pages. I say "brave" because early on EOL was pretty reluctant to embrace Wikipedia on this scale (see the report of the Informatics Advisory Group that I chaired back in 2008), and also because not all of EOL's curators have been thrilled with this development. Partly to assuage their fears, EOL displays Wikipedia-derived content on a yellow background to flag its "unreviewed" status, such as this image of the python genus Leiopython:

It's interesting to compare EOL's approach to Wikipedia with that taken by the BBC, as documented in Case Study: Use of Semantic Web Technologies on the BBC Web Sites. The BBC makes extensive use of content from community-driven external sites such as MusicBrainz and Wikipedia. They embed the content in their own pages, stating where the content came from, but not flagging it as any less meaningful or reliable than the BBC's own content (i.e., no garish yellow background).

Furthermore, the BBC does two clever things. Firstly:

To facilitate integration with the resources external to bbc.co.uk the music site reuses MusicBrainz URL slugs and Wildlife Finder Wikipedia URL slugs. This means that it is relatively straight forward to find equivalent concepts on Wikipedia/DBpedia and Wildlife Finder and, MusicBrainz and /music.

This means that if the identifier for the artist Bat for Lashes in Musicbrainz is http://musicbrainz.org/artist/10000730-525f-4ed5-aaa8-92888f060f5f.html, the BBC reuse the "slug" 10000730-525f-4ed5-aaa8-92888f060f5f and create a page at http://www.bbc.co.uk/music/artists/10000730-525f-4ed5-aaa8-92888f060f5f. Likewise, if the Wikipedia page for Varanus komodoensis is http://en.wikipedia.org/wiki/Komodo_dragon, then the BBC Wildlife Finder page becomes http://www.bbc.co.uk/nature/species/Komodo_dragon, reusing the slug Komodo_dragon.

Reusing identifiers like this can greatly facilitate linking between databases. I don't need to do a search, or approximate string matching, I just reuse the slug. Note that this is a two-way thing, it is trivial for Musicbrainz to create links to BBC information, and visa versa. Reusing identifiers isn't new, other examples include Amazon.com's ASIN (which for books are ISBNs), and BHL reuses uBio NameBankIDs -- want literature that mentions the Komodo dragon? Use the uBio NameBankID 2546401 in a BHL URL http://www.biodiversitylibrary.org/name/2546401.

The second clever thing the BBC does is treat the web as a content management system:

BBC Music is underpinned by the Musicbrainz music database and Wikipedia, thereby linking out into the Web as well as improving links within the BBC site. BBC Music takes the approach that the Web itself is its content management system. Our editors directly contribute to Musicbrainz and Wikipedia, and BBC Music will show an aggregated view of this information, put in a BBC context.

Instead of separating BBC and Wikipedia content (and putting the later in quarantine as does EOL), the BBC embraces Wikipedia, editing Wikipedia content if they feel a page need improving. One advantage of this approach is that it avoids the need for the BBC to replicate Wikipedia, either in terms of content (the BBC doesn't need to write its own descriptions of what an organism does) or services (the BBC doesn't need to develop tools for people to edit the BBC pages, people use Wikipedia's infrastructure for this). Wikipedia provides core text and identifiers, BBC provides its own unique content and branding.

EOL is trying something different, and perhaps more challenging (at least to do it properly). Given that both EOL and Wikipedia offer text about organisms, there is likely to be overlap (and possibly conflict) between what EOL and Wikipedia say about the same taxon. Furthermore, there will be duplication of information such as bibliographic references. For example, the Wikipedia content included in the EOL page for Leiopython contains a bibliography, which includes these references:

Hubrecht AAW. 1879. Notes III on a new genus and species of Pythonidae from Salawatti. Notes from the Leyden Museum 14-15.

Boulenger GA. 1898. An account of the reptiles and batrachians collected by Dr. L. Loria in British New Guinea. Annali del Museo Civico de Storia Naturale di Genova (2) 18:694-710

The genus name Leiopython was published by Hubrecht (1879), and Boulenger (1898) is cited in support of a claim that a distribution record is erroneous. Hence, these look like useful papers to read. Neither reference on the Wikipedia page is linked to an online version of the article, but both have been scanned by EOL's partner BHL (you can see the articles in BioStor here, and here, respectively)¹.

Problem is, you'd be hard pressed to discover this from the EOL page. The BHL results do list the journal Notes from the Leyden Museum, but you'd have to visit the links manually to discover whether they include Hubrecht (1879) (they do, as well as various occurences of Leiopython in the indices for the journal). In part this problem is a consequence of the crude way EOL handles bibliographies retrieved from BHL, but it's symptomatic of a broader problem. By simply mashing EOL and Wikipedia content together, EOL is missing an opportunity to make both itself and Wikipedia more useful. Surely it would be helpful to discover what publications cited on Wikipedia pages are in BHL (or in the list of references for hand-curated EOL pages)? This requires genuine integration (for example by reusing existing bibliographic identifiers such as DOIs, and tools such as OpenURL resolvers). If it fails to do this, EOL will resemble crude pre-Web 2.0 mashups where people created web pages that had content from external sites enclosed in <IFRAME> tags.

The contrast between the approaches adopted by EOL and the BBC is pretty stark. The BBC has devolved text content to external, community-driven sites that it thinks will do a better job than the BBC could alone. EOL is trying to integrate Wikipedia into it's own text content, but without addressing the potentially massive duplication (and, indeed, possible contradictions) that are likely to arise. Perhaps it's time for EOL to be as brave as the BBC, as ask itself whether it is sensible for EOL to try and occupy the same space as Wikipedia.

1. Note that the bibliographic details of both papers are wanting, Hubrecht 1879 is in volume 1 of Notes from the Leyden Museum, and Annali del Museo Civico de Storia Naturale di Genova series 2, volume 18 is also treated as volume 38.

Tuesday, February 26, 2008

Encyclopedia of Life - first impressions

Some thoughts on the first release of the Encyclopedia of Life. I am being deliberately critical. This is a high profile project with tens of millions of dollars in funding, lots of people involved, and is accompanied by some of the most overblown hype in organismal biology. In a sense I think EOL has set itself up by over promising and under delivering.

Before continuing, I should point out that I am involved in EOL in an advisory capacity, but not in actually making anything. Some of the tools I've blogged about have made there way into EOL, such as Pygmybrowse and reference parsing (see David Shorthouse's excellent work on this).

Lack of content
I think the first release of EOL should have, at a minimum, provided at least as much information that I can get from iSpecies and Wikipedia. Other projects, such as Freebase, have pre-populated their databases with content from Wikipedia and other sources. Why didn't EOL? If the argument is that they want authenticated content, then this doesn't wash. Their authenticated content is minimal, and waiting for authentication will, in my view, cripple EOL.

Exemplars are incomplete
The first release contains 25 exemplars. Pages for these taxa

...show the kind of rich environment, with extensive information, to which all the species pages will eventually grow. The information on the exemplar pages has been authenticated (endorsed) by the scientists whose names are listed on these pages.

Well, I hope this isn't the standard EOL aspires to. The pages are incomplete and not interlinked. One of the 25 chosen exemplars is Anolis carolinensis. EOL lists its distribution as:

Widely-distributed throughout the southeastern United States: North Carolina to Key West, Florida, and west to southest Oklahoma and central Texas.

However, the GBIF map EOL displays shows lots of dots in Hawaii:

The EOL account is silent on this interesting distribution pattern. It will come as no suprise that the Wikipedia account of the same species tells us that it has been introduced into Hawaii. Wikipedia 1, EOL 0.

Links

If two pages talk about species that are ecologically associated, then surely those pages should be linked? Among the exemplars is Pissodes strobi, the white pine weevil. In the EOL account, among the hosts listed is Pinus strobus, another exemplar taxon. The accounts of these two taxa are not linked. No hyperlink, nothing. The reader has no idea that there is an exemplar account for Pinus strobus. Furthermore, when reading the account for Pinus strobus there is no indication that it is host to the white pine weevil.
Surely the point of having all this information in one place is so that it can be linked together?

BHL
EOL also exposes some limitations of the Biodiversity heritage Library. Consider the exemplar page for Pinus strobus L. The "L." indicates that this species was described by Linnaeus. Among the many references listed by BHL, none are by Linnaeus. What gives?

Well, the IPNI record reveals that this species was described on p. 1001 of Species Plantarum. BHL has digitised Species Plantarum, and page 1001 has Pinus strobus:

Now, BHL relies on uBio's tools to extract names, and Linnaeus didn't make this easy (the specific epithet strobus is in the right hand margin, separate from Pinus), but one would have thought that for the exemplar taxa an effort would have been made to link Linnaean names to BHL content -- what better place to showcase the link between a name and its publication? It's quite easy to do, given that IPNI has page numbers for plant names. Just map page numbers to BHL URLs, and you're done.

Inconsistency
Going down the taxonomic hierarchy weird things happen. When viewing the plant genus Morus if I can see a picture of Morus nigra (presumably this is "authenticated" content). If I drill down to the species Morus nigra, I'm told there is no authenticated content for this species. Either the image is Morus nigra or it isn't. If it is, why not show it, if it isn't, why claim that it is?

Logos

Way too much space is devoted to logos of various contributors, BHL being the worst offender (it doesn't help that the BHL content is incomplete, lacking links for Linnaean names). I don't care about logos. Contributors may care about getting their logos displayed, but users couldn't care less. They get in the way. On some pages, there's more screen space devoted to logos than information (e.g., the page for Apomys datae). This is, frankly, ridiculous, and reflects a warped set of priorities.

What's worse, all these logos are associated with links that take people away from EOL. Hence EOL becomes little more than a collection of web links to other sites.

Search
The search is based on the Catalogue of Life, and inherits the same problems. For example, if I search for "Morus" I get a list in alphabetical order of taxonomic names that contain the string "morus". The two names that are an exact match occur as items three and four on the list -- they should be first and second.

It gets worse if I search on "Tyrannosaurus rex". EOL doesn't do dinosaurs, and so doesn't contain anything on T. rex, but the search results tell me that The following 116 search results contain 'Tyrannosaurus rex'. Nope, none of them do.

The search engine is poorly done, it fails to rank results sensibly, incorrectly reports what it does find, and has no support for spelling mistakes.

Authenticated content
This is probably the thing that, if left as it is, will strangle EOL. The insistence on "authenticated (endorsed)" content places a severe brake on what EOL can offer.

It's a web site
EOL's web site has no mechanism for people to extract data (e.g., RSS feeds, microformats, links to RDF, etc.). It's intended to be read by humans, not machines. This greatly diminishes its utility.

So, I've got that off my chest. The first release was always going to be a disappointment, especially given the hype. What frustrates me, however, is just how far the first release is from what it could have been.

The real question is how much the issues I've raised are things which are easy to fix given time, or whether they reflect underlying problems with the way the project is conceived.

EOL live

The first release of the Encyclopedia of Life is officially live today. I have promised to be very good...

Wednesday, May 09, 2007

Encyclopedia of Life Launch

The Encyclopedia of Life web site is up, together with some rather breathless publicity and this cool movie. Of course, it's all vapourware just now. I'm involved in some of the informatics in an advisory role. It will be interesting to see what happens. Let's hope that the fate of EoL will be different to that of the similarly ambitious All Species. Oh, and then there's SpeciesBase...

For some reaction see Slashdot.