Rants, raves (and occasionally considered opinions) on phyloinformatics, taxonomy, and biodiversity informatics. For more ranty and less considered opinions, see my Twitter feed. ISSN 2051-8188. Written content on this site is licensed under a Creative Commons Attribution 4.0 International license.
As much as I like the idea of a globally unique, resolvable identifier, my recent experience with JSTOR is making me wonder.
JSTOR has three identifiers for articles it archives, DOIs, SICIs, and stable URLs (the later being introduced with the new platform released April 4, 2008). Previously JSTOR would publish DOIs for many of its articles. However, not all of these work, and many are now embedded in the HTML (say, in Dublin Core meta elements) but not publicly displayed.
Journals in JSTOR have "moving walls" that define the time lag between the most current issue published and the content available in JSTOR. The majority of journals in the archive have moving walls of between 3 and 5 years, but publishers may elect walls anywhere from zero to 10 years.
Now, imagine that a publisher has an article on its web site, complete with a DOI, and that article is then add to JSTOR, but is still displayed on the publisher's site.
To make this concrete, consider the article by Baum et al. . On the InformaWorld site this is displayed with doi:10.1080/106351598260879. The same article is also in JSTOR, with the URL http://www.jstor.org/pss/2585367. No DOI is displayed on the page, but if you look at the HTML source, we find: <meta name="dc.Identifier" scheme="doi" content="10.2307/2585367">. The DOI prefix 10.2307 is used for all JSTOR DOIs, and some for Systematic Biology still work, e.g. 10.2307/2413524.
Now, what happens when the JSTOR moving wall overlaps with publisher's material? What happens if a publisher digitises back issues, then assigns them DOIs? Do the JSOR DOIs then die (as some of them seem to have already done)? And what happens to the poor sap like me, who has been linking to JSTOR DOIs in the naive belief that DOIs don't die?
Suddenly separating identity from resolution is starting to look very attractive...
Setting all these reservations and biases aside, the total number of living organisms that have received Latin binomial names is currently around 1.5 million or so. Amazingly, there is as yet no centralized computer index of these recorded species. It says a lot about intellectual fashions, and about our values, that we have a computerized catalog entry, along with many details, for each of several million books in the Library of Congress but no such catalog for the living species we share our world with. Such a catalog, with appropriately coded information about the habitat, geographical distribution, and characteristic abundance of the species in question (no matter how rough or impressionistic), would cost orders of magnitude less money than sequencing the human genome; I do not believe such a project is orders of magnitude less important. Without such a factual catalog, it is hard to unravel the pattems and processes that determine the biotic diversity of our planet.
<rant> BioOne sucks. Really, really, sucks. I have lost count of the number of times they break DOIs. These are supposed to be the gold standard globally unique identifier, and BioOne continually buggers them. For example, take this URL:
Note the doi=10.1600/02-14.1 bit at the end. If we go to the web page, we see this DOI displayed at the bottom of the page. Yet, when we resolve the DOI, we get the dreaded DOI Not Found error.
This.should.not.happen.
What is BioOne doing!? Now, it is possible that there's a problem with CrossRef, because Googling this paper I found it also lives on Ingenta with, wait for it, another DOI (doi:10.1600/036364404772973960).
This.should.not.happen.
BioOne and Ingenta are hosting the same paper, with different DOIs, only one of which is working. Will somebody please bang some heads together and sort this out! </rant>
Partly inspired by Pedro Beltra's post Open Science project on domain family expansion about using Google Code as a project management system, I've started to populate the iPhylo project. At this stage I'm uploading some scripts for parsing and extracting bibliographic records, and adding wiki pages describing how this is done, discussing different bibliographic identifiers, etc. The aim is to slowly document the background to all the harvesting and linking that I'm working on. Hence, the Google Code project will have documentation and data, not just code. The code for the web site won't go in for a while yet, it needs massive cleaning and tidying up.
MarkMail is a great tool for searching mail archives. Although focussing on software development projects, they are open to requests, so last week I asked if they could index TAXACOM. My pitch was that TAXACOM is a long running list full of interesting conversations, has been the subject of scholarly study (Christine Hine's book I mentioned earlier), and is topical given interest in biodiversity and the Encyclopedia of Life.
A broader theme might be databases, or the taxonomic impediment. Below is the chart of messages over time for the query "databases". You can "swipe" the mouse across the chart to select messages from a given time span.
I worry that I may end up spending more time playing with this than I should, but it's a neat tool. Hats off to Jason Hunter of MarkMail for adding TAXACOM so promptly.
The JIT is an advanced JavaScript infovis toolkit based on 5 papers about different information visualization techniques. The JIT implements advanced features of information visualization like Treemaps (with the slice and dice and squarified methods), an adapted visualization of trees based on the Spacetree, a focus+context technique to plot Hyperbolic Trees, and a radial layout of trees with advanced animations (RGraph).
Nicolas also links to a talk by Tamara Munzner, which I've embedded below to remind myself to watch it.
Trivial as this may seem, I'm trying to find out who designed this "Open Access" logo, and whether there are some original files for it. I've seen this logo (or variations on it) on the PLoS web site, the open access publisher Hindawi Publishing, and the Mac OS X program Papers uses it.
It's driving me nuts that I can't find the original. Other widely used logos typically have a site where a designer or organisation provides a bunch of versions in different formats, such as the Creative Commons symbols, the ubiquitous RSS feed icon, and other projects such as the Geotag icon. It's sometimes desirable to have different formats of an icon, and ideally have a vector-based version (e.g., in EPS or SVG) format that can be used to create images at different resolutions, and these projects provide these files.
Apart from the interesting fact that there doesn't seem to be a standard logo or symbol for Open Access, does anybody know where this logo came from?
The more I play with GBIF the more I come across some spectacular errors. Here's one small example of what can go wrong, and how easy it would be to fix at least some of the errors in GBIF. This is topical given that the recent review of EOL highlighted the importance of vetting and cleaning data.
Oops, the frog is found in the middle of the South Atlantic(!), and in Brazil(!?). These specimen records are provided by the MCZ, Harvard. Looking at the latitude and longitude co-ordinates, it's clear that there has been a comedy of errors. In the case of MCZ A-119852 the longitude is west instead of east, for MCZ A-119850 and MCZ A-119851 the latitude and longitudes have been swapped, and the longitude is west instead of east (again). If we make these changes, the specimens go back to Madagascar (the rectangle on the SVG map below). If you don't see the map, use a decent web browser such as Safari 3 or Firefox 2. If you must use Internet Explorer, grab the RENESIS player.
Interestingly the DiGIR records all list the country as Madagascar, so for any specimen in GBIF it would be trivial to test whether:
do the co-ordinates for the specimen fall inside the bounding box for the country?
if not, will they if we change sign (i.e., hemispheres) and/or swap latitude and longitude?
These would be trivial things to do, and would probably catch a lot of the more egregious errors in GBIF.
Fixing errors What will be interesting is whether these records will be fixed. I have sent feedback via GBIF's web site, as well as sending an email to the MCZ. I'll let readers know what happens.
Ground truth
Lastly, those interested in the frog itself may find the iSpecies search frustrating as the link returned by Google Scholar leads to a page in Ingenta saying:
This title is now published by Blackwell Publishing and can be found here www.ingentaconnect.com/content/bsc/zoj
Nope, the paper in question is actually at ScienceDirect (doi:10.1006/zjls.1995.0040). This paper describes the species, and gives the latitude and longitude of the collection localities (correctly).
I started this blog with the goal of documenting my own efforts to make a database of evolutionary trees, based on ideas sketched in hdl:10.1038/npre.2007.1028.1. I've felt that the major task is link phylogenies to other information, such as taxon names, specimens, localities, images, publications, etc. That is, to embed trees in a broader context. Discovering how to engage with that broader context led to a bunch of experiments, toys, and diversions:
iSpecies, a toy to aggregate information on a species.
Semant, experiments with RDF and triple stores (AKA the Semantic Web).
bioGUID, an attempt to make identifiers resolvable, with an increasing focus on developing an OpenURL resolver for biodiversity literature.
iSpecies and bioGUID are still operational, but the ant work fell victim to server crashes, and a growing frustration with the limitations of triple stores. Blogs for all three projects document their histories: iSpecies, Semant, and bioGUID. In a sense, these blogs document the steps along the way to iPhylo.
Based on this experience, I started again with what I've previously referred to as a database of everything. The first public demo is online at iphylo.org/~rpage/demo1. It's very crude, but may give a sense of what I'm trying to do.
The goal of iPhylo is to treat biodiversity objects as equal citizens. Each object has a unique identifier, associated metadata, and is linked to other objects (for example, a specimen is linked to sequences, sequences are linked to publications, etc.). By following the links it is possible to generate new views on existing information, such as a map for study that doesn't have any maps. For example, below is a map generated for Brady et al. (doi:10.1073/pnas.0605858103), based on links between sequences and specimens (if you can't see the map you need a SVG-capable web browser, such as Safari 3 or Firefox 2).
Ironically there are no phylogenies yet. At this stage I'm trying to link the bits together.
How does it work? More on this later. Briefly, iPhylo uses a entity-attribute-value model database to store objects and their relationships. Like bioGUID, iPhylo relies on a suite of web services (most external, some I've developed locally) to locate and resolve identifiers. iPhylo resolves identifiers for PubMed records, GenBank sequences, museum specimens, publications, etc. and adds the associated metadata to a local database. Wherever possible it resolves any links in the metadata (e.g., if a GenBank record mentions a specimen, iPhylo will try and retrieve information on that specimen). When you view an object in iPhylo, these links are displayed. iPhylo will also try and convert bibliographic records to identifiers (such as DOIs) if no identifiers are provided, and also extracts georeferences for specimens and sequences, either from original records or by using a georeferencing service. Taxonomic names are resolved using uBio, and are treated as "tags."
At present iPhylo is being populated by various scripts, there is no facility for users to add data. This is something I will add in the future.
Getting the data One of the biggest challenges is getting data (or, to be more precise, figuring out how to harvest available data). iPhylo builds on code for bioGUID. I've also been exploring bulk harvesting data sources. Sometimes this is easy. Many sequences in Genbank are linked to records in PubMed, so if you know the Pubmed id for a paper, you can harvest the sequences. For example, even though the Bulletin of the American Museum of Natural History isn't indexed in PubMed, it is possible to retrieve all the sequences GenBank records as being from papers published in the Bulletin. You can retrieve this list in XML form by clicking here.
Why?
There are all sorts of things which could be done with this. For example, by linking objects together we can also track the provenance of data, and ultimately build "citation networks" of specimens, sequences, etc. For background see my paper on "Biodiversity informatics: the challenge of linking data and the role of shared identifiers" (doi:10.1093/bib/bbn022, preprint at hdl:10101/npre.2008.1760.1).
As mentioned above, we can generate maps even if the original study didn't include one (by following the links). Given that we can geotag studies, this opens up the possibility to query studies spatially. For example, this study on bats and this study on rodents deal with taxa with very similar distributions. A spatial query could find these easily. Imagine interested in, say, Madagascar, and being able to find phylogenetic studies , even if the title and abstract of the paper don't mention Madagascar by name.
There also potential to clean data. One of the first studies I uploaded is Grant et al.'s study of dart-poison frogs. The map for this study shows a outlying point in California:
This point is MVZ 199828, which is a specimen of the salamander Aneides flavipunctatus. In GenBank, MVZ 199828 is listed as the voucher for seven sequences from the frog Mannophryne trinitatis. Oops. A quick iSpecies search, and a click on the GBIF map reveals that there is a specimen MVZ 199838 of Mannophryne trinitatis. i suspect that this is the true voucher for these sequences, and the GenBank records contain a typo.
Future This is all still very, very crude. The demo is slow, the queries aren't particularly clever, and I've probably gone overboard on using Javascript to populate the web pages. The real value isn't in the web pages, but the links between the data objects. This is my main focus -- extracting and adding links. For now the data is displayed but you can't edit it. However, this is coming. Very basic RDF and RSS feeds are available for each object, and fans of microformats will find some goodies, and sociologists of science may find some of the coauthorship graphs intriguing
The dissection of the colossal squid (Mesonychoteuthis hamiltoni) specimen from Antarctica has been getting a lot of coverage. Pangs of homesickness, especially seeing Steve O'Shea enthusing about the beast. Steve was a contemporary at Auckland Uni when I was a student. I remember him being deeply disappointed in me because I moved away from doing alpha taxonomy of crustaceans (describing taxa such as Pinnotheres atrinicola, right) to fussing with cladistics and computers. Looking at Steve on YouTube, I think it's clear who's having more fun. Maybe I should have stuck to taxonomy after all...
Last month I was at the MBL in Woods Hole, taking part in the review of the Biodiversity Informatics Group. BIG is responsible for the EOL web site. I chair the Informatics Advisory Group, which provides advice to BIG, and it was our task to produce an evaluation of where things stood. I've written a post on the Encyclopaedia of Life blog about some of the big challenges facing EOL as it moves into its second year.