iPhylo

Roderic D. M. Page

Thursday, May 29, 2008

When DOIs collide and then disappear: when is a unique, resolvable identifier a bad idea?

As much as I like the idea of a globally unique, resolvable identifier, my recent experience with JSTOR is making me wonder.

JSTOR has three identifiers for articles it archives, DOIs, SICIs, and stable URLs (the later being introduced with the new platform released April 4, 2008). Previously JSTOR would publish DOIs for many of its articles. However, not all of these work, and many are now embedded in the HTML (say, in Dublin Core meta elements) but not publicly displayed.

I suspect the issue is the moving wall:

Journals in JSTOR have "moving walls" that define the time lag between the most current issue published and the content available in JSTOR. The majority of journals in the archive have moving walls of between 3 and 5 years, but publishers may elect walls anywhere from zero to 10 years.

Now, imagine that a publisher has an article on its web site, complete with a DOI, and that article is then add to JSTOR, but is still displayed on the publisher's site.

To make this concrete, consider the article by Baum et al. . On the InformaWorld site this is displayed with doi:10.1080/106351598260879. The same article is also in JSTOR, with the URL http://www.jstor.org/pss/2585367. No DOI is displayed on the page, but if you look at the HTML source, we find:
<meta name="dc.Identifier" scheme="doi" content="10.2307/2585367">. The DOI prefix 10.2307 is used for all JSTOR DOIs, and some for Systematic Biology still work, e.g. 10.2307/2413524.

Now, what happens when the JSTOR moving wall overlaps with publisher's material? What happens if a publisher digitises back issues, then assigns them DOIs? Do the JSOR DOIs then die (as some of them seem to have already done)? And what happens to the poor sap like me, who has been linking to JSTOR DOIs in the naive belief that DOIs don't die?

Suddenly separating identity from resolution is starting to look very attractive...

Friday, May 23, 2008

QOTD

Setting all these reservations and biases aside, the total number of living organisms that have received Latin binomial names is currently around 1.5 million or so. Amazingly, there is as yet no centralized computer index of these recorded species. It says a lot about intellectual fashions, and about our values, that we have a computerized catalog entry, along with many details, for each of several million books in the Library of Congress but no such catalog for the living species we share our world with. Such a catalog, with appropriately coded information about the habitat, geographical distribution, and characteristic abundance of the species in question (no matter how rough or impressionistic), would cost orders of magnitude less money than sequencing the human genome; I do not believe such a project is orders of magnitude less important. Without such a factual catalog, it is hard to unravel the pattems and processes that determine the biotic diversity of our planet.

--Robert M. May, 1988, "How many species are there on Earth? doi:10.1126/science.241.4872.1441

Not much has changed in twenty years...

BioOne (and/or CrossRef) sucks

<rant>
BioOne sucks. Really, really, sucks. I have lost count of the number of times they break DOIs. These are supposed to be the gold standard globally unique identifier, and BioOne continually buggers them. For example, take this URL:

http://www.bioone.org/perlserv/?request=get-abstract&doi=10.1600/02-14.1.

Note the doi=10.1600/02-14.1 bit at the end. If we go to the web page, we see this DOI displayed at the bottom of the page. Yet, when we resolve the DOI, we get the dreaded DOI Not Found error.

This.should.not.happen.

What is BioOne doing!? Now, it is possible that there's a problem with CrossRef, because Googling this paper I found it also lives on Ingenta with, wait for it, another DOI (doi:10.1600/036364404772973960).

This.should.not.happen.

BioOne and Ingenta are hosting the same paper, with different DOIs, only one of which is working. Will somebody please bang some heads together and sort this out!
</rant>

Thursday, May 22, 2008

iPhylo on Google Code

Partly inspired by Pedro Beltra's post Open Science project on domain family expansion about using Google Code as a project management system, I've started to populate the iPhylo project. At this stage I'm uploading some scripts for parsing and extracting bibliographic records, and adding wiki pages describing how this is done, discussing different bibliographic identifiers, etc. The aim is to slowly document the background to all the harvesting and linking that I'm working on. Hence, the Google Code project will have documentation and data, not just code. The code for the web site won't go in for a while yet, it needs massive cleaning and tidying up.

Wednesday, May 21, 2008

TAXACOM indexed by MarkMail

MarkMail is a great tool for searching mail archives. Although focussing on software development projects, they are open to requests, so last week I asked if they could index TAXACOM. My pitch was that TAXACOM is a long running list full of interesting conversations, has been the subject of scholarly study (Christine Hine's book I mentioned earlier), and is topical given interest in biodiversity and the Encyclopedia of Life.

They've done so, and the result is http://taxacom.markmail.org/ . It's just fabulous. For example, here are some things you can do. Being terribly egocentric, I looked at my own posts:
http://taxacom.markmail.org/search/?q=from:"Roderic+Page".

Realising that there are other people in the world, I queried for posts about GBIF:
http://taxacom.markmail.org/search/?q=gbif.

A broader theme might be databases, or the taxonomic impediment. Below is the chart of messages over time for the query "databases". You can "swipe" the mouse across the chart to select messages from a given time span.

I worry that I may end up spending more time playing with this than I should, but it's a neat tool. Hats off to Jason Hunter of MarkMail for adding TAXACOM so promptly.

Tuesday, May 13, 2008

JavaScript Information Visualization Toolkit

I've just discovered Nicolas Garcia Belmonte's JavaScript Information Visualization Toolkit (JIT). Wow! This is very cool stuff (and no Flash). To quote from the web site:

The JIT is an advanced JavaScript infovis toolkit based on 5 papers about different information visualization techniques.
The JIT implements advanced features of information visualization like Treemaps (with the slice and dice and squarified methods), an adapted visualization of trees based on the Spacetree, a focus+context technique to plot Hyperbolic Trees, and a radial layout of trees with advanced animations (RGraph).

Nicolas also links to a talk by Tamara Munzner, which I've embedded below to remind myself to watch it.

Thursday, May 08, 2008

Open Access logo - help

Trivial as this may seem, I'm trying to find out who designed this "Open Access" logo, and whether there are some original files for it. I've seen this logo (or variations on it) on the PLoS web site, the open access publisher Hindawi Publishing, and the Mac OS X program Papers uses it.

It's driving me nuts that I can't find the original. Other widely used logos typically have a site where a designer or organisation provides a bunch of versions in different formats, such as the Creative Commons symbols, the ubiquitous

RSS feed icon, and other projects such as the

Geotag icon. It's sometimes desirable to have different formats of an icon, and ideally have a vector-based version (e.g., in EPS or SVG) format that can be used to create images at different resolutions, and these projects provide these files.

Apart from the interesting fact that there doesn't seem to be a standard logo or symbol for Open Access, does anybody know where this logo came from?

Fixing GBIF

The more I play with GBIF the more I come across some spectacular errors. Here's one small example of what can go wrong, and how easy it would be to fix at least some of the errors in GBIF. This is topical given that the recent review of EOL highlighted the importance of vetting and cleaning data.

The frog Boophis periegetes features in a recent study of DNA barcoding (doi:10.1186/1742-9994-2-5). The sequences from this study (AY848605-9) aren't georeferenced in GenBank, but in iPhylo they are courtesy of Metacarta's web services. The sequences are located in Madagascar.

Finding errors

Curious about the frog I did a search in iSpecies and got the following map:

Oops, the frog is found in the middle of the South Atlantic(!), and in Brazil(!?).
These specimen records are provided by the MCZ, Harvard. Looking at the latitude and longitude co-ordinates, it's clear that there has been a comedy of errors. In the case of MCZ A-119852 the longitude is west instead of east, for MCZ A-119850 and MCZ A-119851 the latitude and longitudes have been swapped, and the longitude is west instead of east (again). If we make these changes, the specimens go back to Madagascar (the rectangle on the SVG map below). If you don't see the map, use a decent web browser such as Safari 3 or Firefox 2. If you must use Internet Explorer, grab the RENESIS player.

Interestingly the DiGIR records all list the country as Madagascar, so for any specimen in GBIF it would be trivial to test whether:

do the co-ordinates for the specimen fall inside the bounding box for the country?
if not, will they if we change sign (i.e., hemispheres) and/or swap latitude and longitude?

These would be trivial things to do, and would probably catch a lot of the more egregious errors in GBIF.

Fixing errors
What will be interesting is whether these records will be fixed. I have sent feedback via GBIF's web site, as well as sending an email to the MCZ. I'll let readers know what happens.

Ground truth

Lastly, those interested in the frog itself may find the iSpecies search frustrating as the link returned by Google Scholar leads to a page in Ingenta saying:

This title is now published by Blackwell Publishing and can be found here www.ingentaconnect.com/content/bsc/zoj

Nope, the paper in question is actually at ScienceDirect (doi:10.1006/zjls.1995.0040). This paper describes the species, and gives the latitude and longitude of the collection localities (correctly).

Monday, May 05, 2008

iPhylo Demo

I started this blog with the goal of documenting my own efforts to make a database of evolutionary trees, based on ideas sketched in hdl:10.1038/npre.2007.1028.1. I've felt that the major task is link phylogenies to other information, such as taxon names, specimens, localities, images, publications, etc. That is, to embed trees in a broader context. Discovering how to engage with that broader context led to a bunch of experiments, toys, and diversions:

iSpecies, a toy to aggregate information on a species.
Semant, experiments with RDF and triple stores (AKA the Semantic Web).
bioGUID, an attempt to make identifiers resolvable, with an increasing focus on developing an OpenURL resolver for biodiversity literature.

iSpecies and bioGUID are still operational, but the ant work fell victim to server crashes, and a growing frustration with the limitations of triple stores. Blogs for all three projects document their histories: iSpecies, Semant, and bioGUID. In a sense, these blogs document the steps along the way to iPhylo.

Based on this experience, I started again with what I've previously referred to as a database of everything. The first public demo is online at iphylo.org/~rpage/demo1. It's very crude, but may give a sense of what I'm trying to do.

The goal of iPhylo is to treat biodiversity objects as equal citizens. Each object has a unique identifier, associated metadata, and is linked to other objects (for example, a specimen is linked to sequences, sequences are linked to publications, etc.). By following the links it is possible to generate new views on existing information, such as a map for study that doesn't have any maps. For example, below is a map generated for Brady et al. (doi:10.1073/pnas.0605858103), based on links between sequences and specimens (if you can't see the map you need a SVG-capable web browser, such as Safari 3 or Firefox 2).

Ironically there are no phylogenies yet. At this stage I'm trying to link the bits together.

How does it work?
More on this later. Briefly, iPhylo uses a entity-attribute-value model database to store objects and their relationships. Like bioGUID, iPhylo relies on a suite of web services (most external, some I've developed locally) to locate and resolve identifiers. iPhylo resolves identifiers for PubMed records, GenBank sequences, museum specimens, publications, etc. and adds the associated metadata to a local database. Wherever possible it resolves any links in the metadata (e.g., if a GenBank record mentions a specimen, iPhylo will try and retrieve information on that specimen). When you view an object in iPhylo, these links are displayed. iPhylo will also try and convert bibliographic records to identifiers (such as DOIs) if no identifiers are provided, and also extracts georeferences for specimens and sequences, either from original records or by using a georeferencing service. Taxonomic names are resolved using uBio, and are treated as "tags."

At present iPhylo is being populated by various scripts, there is no facility for users to add data. This is something I will add in the future.

Getting the data
One of the biggest challenges is getting data (or, to be more precise, figuring out how to harvest available data). iPhylo builds on code for bioGUID. I've also been exploring bulk harvesting data sources. Sometimes this is easy. Many sequences in Genbank are linked to records in PubMed, so if you know the Pubmed id for a paper, you can harvest the sequences. For example, even though the Bulletin of the American Museum of Natural History isn't indexed in PubMed, it is possible to retrieve all the sequences GenBank records as being from papers published in the Bulletin. You can retrieve this list in XML form by clicking here.

Why?

There are all sorts of things which could be done with this. For example, by linking objects together we can also track the provenance of data, and ultimately build "citation networks" of specimens, sequences, etc. For background see my paper on "Biodiversity informatics: the challenge of linking data and the role of shared identifiers" (doi:10.1093/bib/bbn022, preprint at hdl:10101/npre.2008.1760.1).

As mentioned above, we can generate maps even if the original study didn't include one (by following the links). Given that we can geotag studies, this opens up the possibility to query studies spatially. For example, this study on bats and this study on rodents deal with taxa with very similar distributions. A spatial query could find these easily. Imagine interested in, say, Madagascar, and being able to find phylogenetic studies , even if the title and abstract of the paper don't mention Madagascar by name.

There also potential to clean data. One of the first studies I uploaded is Grant et al.'s study of dart-poison frogs. The map for this study shows a outlying point in California:

This point is MVZ 199828, which is a specimen of the salamander Aneides flavipunctatus. In GenBank, MVZ 199828 is listed as the voucher for seven sequences from the frog Mannophryne trinitatis. Oops. A quick iSpecies search, and a click on the GBIF map reveals that there is a specimen MVZ 199838 of Mannophryne trinitatis. i suspect that this is the true voucher for these sequences, and the GenBank records contain a typo.

Future
This is all still very, very crude. The demo is slow, the queries aren't particularly clever, and I've probably gone overboard on using Javascript to populate the web pages. The real value isn't in the web pages, but the links between the data objects. This is my main focus -- extracting and adding links. For now the data is displayed but you can't edit it. However, this is coming. Very basic RDF and RSS feeds are available for each object, and fans of microformats will find some goodies, and sociologists of science may find some of the coauthorship graphs intriguing

Saturday, May 03, 2008

Colossal squid

The dissection of the colossal squid (Mesonychoteuthis hamiltoni) specimen from Antarctica has been getting a lot of coverage. Pangs of homesickness, especially seeing Steve O'Shea enthusing about the beast. Steve was a contemporary at Auckland Uni when I was a student. I remember him being deeply disappointed in me because I moved away from doing alpha taxonomy of crustaceans (describing taxa such as Pinnotheres atrinicola, right) to fussing with cladistics and computers. Looking at Steve on YouTube, I think it's clear who's having more fun. Maybe I should have stuck to taxonomy after all...

Thursday, May 01, 2008

EOL Review

Last month I was at the MBL in Woods Hole, taking part in the review of the Biodiversity Informatics Group. BIG is responsible for the EOL web site. I chair the Informatics Advisory Group, which provides advice to BIG, and it was our task to produce an evaluation of where things stood. I've written a post on the Encyclopaedia of Life blog about some of the big challenges facing EOL as it moves into its second year.

Wednesday, April 30, 2008

Paper published

Bit of a rarity these days. My paper on identifiers in biodiversity informatics, which I mentioned earlier when I deposited the preprint at Nature Precedings, has been published in Briefings in Bioinformatics (doi:10.1093/bib/bbn022).

Here's the abstract:

A major challenge facing biodiversity informatics is integrating data stored in widely distributed databases. Initial efforts have relied on taxonomic names as the shared identifier linking records in different databases. However, taxonomic names have limitations as identifiers, being neither stable nor globally unique, and the pace of molecular taxonomic and phylogenetic research means that a lot of information in public sequence databases is not linked to formal taxonomic names. This review explores the use of other identifiers, such as specimen codes and GenBank accession numbers, to link otherwise disconnected facts in different databases. The structure of these links can also be exploited using the PageRank algorithm to rank the results of searches on biodiversity databases. The key to rich integration is a commitment to deploy and reuse globally unique, shared identifiers [such as Digital Object Identifiers (DOIs) and Life Science Identifiers (LSIDs)], and the implementation of services that link those identifiers.

Monday, April 28, 2008

Google Code wiki using Subversion

For some time now Google Code has been displaying the message:

The web interface for wiki content is currently READ-ONLY for maintenance.
You may still add comments, and members may add, edit, or delete wiki pages via svn. Learn more.

This is a bit of a pain as I've recently put the code for my LSID tester into Google Code (the project is here). Since having a simple wiki is part of the attraction of Google Code, I decided to finally figure out how to add a wiki via Subversion. Turns out it is pretty straight forward. I created a folder called "wiki" and added a file with some wiki markup. I then added it to the repository

svn import -m "Trying to add wiki"
 wiki https://lsid-php.googlecode.com/svn/wiki/ 
--username USERNAME

(do this from the folder containing "wiki", not within the "wiki" folder itself). This adds the contents of the wiki folder to the Google Code repository. You can then check this out:

svn checkout
https://lsid-php.googlecode.com/svn/wiki/ 
lsid-php-wiki --username USERNAME

This probably seems obvious to many, but I'm used to CVS, having run a CVS repository since the late 1990's when Mike Charleston and I were working on TreeMap. I've been resisting moving to Subversion simply because of the hassle of learning stuff that doesn't actually make my life any easier. That said, Google Code is a nice way to host projects.

Thursday, April 03, 2008

Biodiversity informatics: the challenge of linking data and the role of shared identifiers

The manuscript for Briefings in Bioinformatics that I alluded to earlier has been accepted for publication. I've put a preprint up at Nature Preceding (hdl:10101/npre.2008.1760.1). The final version will appear in print later this year.

Thursday, March 20, 2008

Phylowidget

Greg Jordan and Bill Piel have released PhyloWidget, a Java applet for viewing phylogenetic trees. It's very slick, with some nice visual effects courtesy of Processing.
PhyloWidget is open source, with code hosted by Google code. I'm a C++ luddite, so it took me a few moments to figure out how to build the applet, but it's simple enough, just type

ant PhyloWidget

at the command prompt. I got a couple of warnings about missing .keystore files (something to do with signing the applet), but otherwise things seemed to work.
The applet has a URL API, which makes it easy to view trees. For example, try this link to view the Frost et al. amphibian tree (doi:10.1206/0003-0090(2006)297[0001:TATOL]2.0.CO;2).

Systematics as Cyberscience

Vince Smith alerted me to "Systematics as Cyberscience", by Christine Hine, whose work I've mentioned earlier. Looks like an interesting read. From the publisher's blurb:

The use of information and communication technology in scientific research has been hailed as the means to a new larger-scale, more efficient, and cost-effective science. But although scientists increasingly use computers in their work and institutions have made massive investments in technology, we still have little idea how computing affects the way scientists work and the kind of knowledge they produce. In Systematics as Cyberscience, Christine Hine explores these questions by examining the developing use of information and communication technology in one discipline, systematics (which focuses on the classification and naming of organisms and exploration of evolutionary relationships). Her sociological study of the ways that biologists working in this field have engaged with new technology is an account of how one of the oldest branches of science transformed itself into one of the newest and became a cyberscience.

Monday, March 10, 2008

Google's Social Graph API

Google's Social Graph API was released earlier this year.

The motivation:

With so many websites to join, users must decide where to invest significant time in adding their same connections over and over. For developers, this means it is difficult to build successful web applications that hinge upon a critical mass of users for content and interaction. With the Social Graph API, developers can now utilize public connections their users have already created in other web services. It makes information about public connections between people easily available and useful.

Apart from the obvious application to scientific databases (for example, utilising connections such as co-authorship), imagine the same idea applied to data.

Sunday, March 09, 2008

CrossRef adds more information to OpenURL resolver

Tom Pasley recently drew my attention to CrossRef's addition of a XML format parameter to their OpenURL resolver. Adding &format=xml to the OpenURL request retrieves bibliographic metadata in "unixref" format (for those who like this sort of thing, the XML schema is here). The biggest change is now the metadata lists more than one author for multi-author papers.

I tend to use JSON for my work now, so a common task is to convert XML data streams into JSON. I've modified my bioGUID OpenURL resolver to make use of the unixref format, which meant I had to write a XSLT file to convert unixref to JSON. If you're interested, you can grab a copy here. It's not pretty, but it seems to work OK.

For some years now I've relied on Marc Liyanage's excellent tool TestXSLT to develop XSLT files. If you have a Mac and work with XSLT, then I do yourself a favour and grab a copy of this free tool.

Thursday, March 06, 2008

PageRank for biodiversity

This will probably tempt fate, but I've an invited manuscript in review for Briefings in Bioinformatics on the topic of identifiers in biodiversity informatics. Readers of this blog will find much of it familiar (DOis, LSIDs, etc.). For fun I constructed a graph for three ant specimens of Probolomyrmex tani, and the images, DNA sequences, and publications that link to these specimens.

Based on this graph I computed the PageRank of each specimen. The motivation for this exercise is that AntWeb lists 43 specimens of this species, in alphabetical order. This is arbitrary. What if we could order them by their "importance"? One way to do this is based on how many times the specimens have been sequenced, photographed, or cited in scientific papers. This gives us a metric for ordering lists of specimens, as well as demonstrating the "value" of a collection (based on people actually using it in their work). I think there is considerable scope for applying PageRank-like ideas to questions in biodiversity informatics. Robert Huber has an intriguing post on TaxonRank that explores this idea further.

Word for the day - "transclusion"

Stumbled across Project Xanadu, Ted Nelson's vision of the way the web should be (e.g., BACK TO THE FUTURE: Hypertext the Way It Used To Be). Nelson coined the term "transclusion", including one document in side another by reference. The screen shot of Xanadu Space may help illustrate the idea:

Nelson envisages a web where instead of just one-way links, documents include parts of other documents, and one can view a document side-by-side with the source documents. Modern web browsers transclude images (the image file is not "physically" in the document, rather it exists elsewhere), but mostly they link to other documents via hyperlinks.
Ted Nelson's writings are a fascinating read, partly because they remind you just how much of the web we take for granted, and how thinks could be different (better?). One thing he objects to is that much of the the web simulates paper

Much of the field has imitated paper: in word processing (Microsoft Word and Adobe Acrobat) and the World Wide Web, whose rectangular page layouts become a focal issue. It should be noted that these systems imitate paper under glass, since you can't annotate it.

Nelson also advocates every element of a document having its own unique address, not just at book or article level. This resonates with what is happing with digital libraries. Gregory Crane in his article "What Do You Do with a Million Books?" (doi:10.1045/march2006-crane) notes that:

Most digital libraries still mimic their print predecessors, treating individual objects – commonly chunks of PDF, RTF/Word, or HTML with no standard internal structure – as its constituent units. As digital libraries mature and become better able to extract information (e.g., personal and place names), each word and automatically identifiable chunk of words becomes a discrete object. In a sample 300 volume, 55 million word collection of nineteenth-century American English, automatic named entity identification has added 12,000,000 tags. While this collection focuses on name rich historical materials and includes several reference works, this system already discovers thousands of references to named entities in most book length documents. We thus move from single catalogue entries with a few hundred words to thousands of tagged objects – an increase of at least one order of magnitude with named entities and of at least two orders of magnitude when we consider each individual word as an object.

I discovered Crane's paper via Chris Freeland's post On Name Finding in the BHL. Chris summarises BHL's work on scanning biodiversity literature and extracting taxonomic names. BHL's output is at the level of pages, rather than articles. Existing GUIDs for literature (such as DOIs and SICIs) typically identify articles rather than pages (or page elements), so there's a need to extending these to pages.

Chris also raises the issue of ranking and relevance -- "What do you do with 19,000 pages containing Hymenoptera?". One possibility is to explore Robert Huber's TaxonRank idea (inspired by Google's PageRank). This would require text mining to build synonomy lists from scanned papers, challenging but not impossible. But I suspect that the network of citations is what will help build a sensible way to rank the 19,000 pages.

A while ago people were speculating what Google could do to help biodiversity informatics. I found much of this discussion to be vague, with no clear notion of what Google could actually do. What I think Google is exceptionally good at are two things we need to tackle -- text mining, and extracting information from links. I think this is where BHL and, by extension, EOL, should be devoting much of their resources.