Wednesday, May 25, 2011

The top-ten new species described in 2010 and the failure of taxonomy to embrace Open Access publication

Each year the grandly titled International Institute for Species Exploration (IISE) publishes list of the top 10 species described in the previous year. This year's list is reproduced below, to which I've added the links to the original publications (why do people think still it's OK to omit links to the primary literature when all of these articles are online?).

The striking thing is that only 2 of the 10 species were described in Open Access publications (and I use that term loosely as as Arthropod Systematics & Phylogeny PDFs are freely available, but the licensing isn't clear). Sadly much of our knowledge of the planet's diversity is still locked up behind a paywall.

SpeciesReferenceDOI/PDFOpen Access
Caerostris 5Darwin's Bark SpiderKuntner, M. and I. Agnarsson. 2010. Web gigantism in Darwin's bark spider, a new species from Madagascar (Araneidae: Caerostris). The Journal of Arachnology 38(2):346-35610.1636/B09-113.1No
Mycena 2Bioluminescent MushroomDesjardin, D.E., B.A. Perry, D.J. Lodge, C.V. Stevani, and E. Nagasawa. 2010. Luminescent Mycena: new and noteworthy species. Mycologia 102(2):459-47710.3852/09-197No
HalomonasBacteriumSanchez-Porro, C., B. Kaur, H. Mann and A. Ventosa. 2010. Halomonas titanicae sp. nov., a halophilic bacterium isolated from the RMS Titanic. International Journal of Systematic and Evolutionary Microbiology 60(12):2768-277410.1099/ijs.0.020628-0No
VaranusMonitor LizardWelton, L.J., C.D. Siler, D. Bennett, A. Diesmos, M.R. Duya, R. Dugay, E.L.B. Rico, M. van Weerd and R.M. Brown. 2010. A spectacular new Philippine monitor lizard reveals a hidden biogeographic boundary and a novel flagship species for conservation. Biology Letters 6(5):654-65810.1098/rsbl.2010.0119No
GlomeremusPollinating cricketHugel, S., C. Micheneau, J. Fournel, B.H. Warren, A. Gauvin-Bialecki, T. Pailler, M.W. Chase and D. Strasberg. 2010. Glomeremus species from the Mascarene islands (Orthoptera, Gryllacrididae) with the description of the pollinator of an endemic orchid from the island of Réunion. Zootaxa 2545:58-68PDFNo
Philantomba 2DuikerColyn, M., J. Hulselmans, G. Sonet, P. Oudé, J. de Winter, A. Natta, Z.T. Nagy and E. Verheyen. 2010. Discovery of a new duiker species (Bovidae: Cephalophinae) from the Dahomey Gap, West Africa. Zootaxa 2637:1-30PDFNo
TyrannobdellaLeechPhillips, A.J., R. Arauco-Brown, A. Oceguera-Figueroa, G.P. Gomez, M. Beltran, Y.-T. Lai and M.E. Siddall. 2010. Tyrannobdella rex n. gen. n. sp. and the evolutionary origins of mucosal leech infestations. PLoS ONE 5(4):e1005710.1371/journal.pone.0010057Yes
PsathyrellaUnderwater mushroomFrank, J.L., R.A. Coffan and D. Southworth. 2010. Aquatic gilled mushrooms: Psathyrella fruiting in the Rogue River in southern Oregon. Mycologia 102(1):93-10710.3852/07-190No
SaltoblattellaJumping cockroachBohn, H., M. Picker, K.-D. Klass and J. Colville. 2010. A jumping cockroach from South Africa, Saltoblattella montistabularis, gen. nov., spec. nov. (Blattodea: Blattellidae). Arthropod Systematics and Phylogeny 68(1):53-39/td>PDFYes
HalieutichthysPancake BatfishHo, H.-C., P. Chakrabarty and J.S. Sparks. 2010. Review of the Halieutichthys aculeatus species complex (Lophiiformes: Ogcocephalidae), with descriptions of two new species. Journal of Fish Biology 77(4):841-86910.1111/j.1095-8649.2010.02716.xNo

Monday, May 23, 2011

BioStor article published (finally)

LogoMy article describing BioStor — "Extracting scientific articles from a large digital archive: BioStor and the Biodiversity Heritage Library" — has finally seen the light of day in BMC Bioinformatics (doi:10.1186/1471-2105-12-187, the DOI is not working at the moment, give it a little while to go live, meantime you can access the article here).

Getting this article published was more work than I expected. There seems to be an inverse correlation between how important I think the work is and how easy it is to get published — the more straightforward I think the article is the more work it is to convince the referees of its merits. Of course, it may be that my judgement of the article's merits influences how much effort I put into making the manuscript as rigorous and clear as possible. And perhaps having a blog has spoiled me, I really struggle with the notion that it takes months to publish a paper, especially as most of the intellectual debate involved (i.e., the refereeing process) is behind closed doors, compared to the open and immediate nature of commentary on a blog post.

However, despite my frustrations with the referring process, there's no doubt that it did improve the manuscript (you can see the original version at Nature Precedings, hdl:10101/npre.2010.4928.1).

With the publication of this article, and last week's conversation with Anurag Acharya and Darcy Dapra about getting BioStor indexed by Google Scholar, it has been a good few days for BioStor.



Friday, April 15, 2011

BHL, DjVu, and reading the f*cking manual

One of the many biggest challenges I've faced with the BioStor project, apart from dealing with messy metadata, has been handling page images. At present I get these from the Biodiversity Heritage Library. They are big (typically 1 Mb in size), and have the caramel colour of old paper. Nothing fills up a server quicker than thousands of images.

A while ago started playing with ImageMagick to resize the images, making them smaller, as well as ways to remove the background colour, leaving just black text and lines on white background.

Before and after converting BHL image


I think this makes the page image clearer, as well as removing the impression that this is some ancient document, rather than a scientific article. Yes, it's the Biodiversity Heritage Library, but the whole point of the taxonomic literature is that it lasts forever. Why not make it look as fresh as when it was first printed?

Working out how to best remove the background colour takes some effort, and running ImageMagick on every image that's downloaded starts putting a lot of stress on the poor little Mac Mini that powers BioStor.

Then there's the issue of having an iPad viewer for BHL, and making it interactive. So, I started looking at the DjVu files generated by the Internet Archive, and thinking whether it would make more sense to download those and extract images from them, rather than go via the BHL API. I'll need the DjVu files for the text layout anyway (see Towards an interactive DjVu file viewer for the BHL).

I couldn't remember the command to extract images from DjVu, but I did remember that Google is my friend, which led me to this question on Stack Overflow: Using the DjVu tools to for background / foreground seperation?.

OMG! DjVu tools can remove the background? A quick look at the documentation confirmed it. So I did a quick test. The page on the left is the default page image, the page on the right was extracted using ddjvu with the option -mode=foreground.

507.png


Much, much nicer. But why didn't I know this? Why did I waste time playing with ImageMagick when it's a trivial option in a DjVu tool? And why does BHL serve the discoloured page images when it could serve crisp, clean versions?

So, I felt like an idiot. But the other good thing that's come out of this is that I've taken a closer look at the Internet Archive's BHL-related content, and I'm beginning to think that perhaps the more efficient way to build something like BioStor is not through downloading BHL data and using their API, but by going directly to the Internet Archive and downloading the DjVu and associated files. Maybe it's time to rethink everything about how BioStor is built...

Tuesday, April 12, 2011

Dark taxa: GenBank in a post-taxonomic world

How to cite: Page, R. (2011). Dark taxa: GenBank in a post-taxonomic world. https://doi.org/10.59350/xhvv2-xjt24
In an earlier post (Are names really the key to the big new biology?, I questioned Patterson et al.'s assertion in a recent TREE article (doi:10.1016/j.tree.2010.09.004) that names are key to the new biology.

In this post I'm going to revisit this idea by doing a quick analysis of how many species in GenBank have "proper" scientific names, and whether the number of named species has changed over time. My definition of "proper" name is a little loose: anything that had two words, second one starting with a lower case letter, was treated as a proper name. hence, a name like Eptesicus sp. A JLE-2010" is not a proper name, but Eptesicus andersoni is.

Mammals

Since GenBank started, every year has seen some 100-200 mammal species added to the database.


Until around 2003 almost all of these species had proper binomial names, but since then an increasing percentage of species-level taxa haven't been identified to species. In 2010 three-quarters of new tax_ids for mammals weren't identified.

Invertebrates

For "invertebrates" 2010 saw an explosive growth in the number of new taxa sequenced, with nearly 71,000 new taxa added to GenBank.



This coincides with a spectacular drop in the number of properly-named taxa, but even before 2010 the proportion of named invertebrate species in GenBank was in decline: in 2009 just over a half of the species added had binomials.

Bacteria

To put this in perspective, here are the equivalent graphs for bacteria.
Although at the outset most of the bacteria in GenBank had binomial names, pretty quickly the bulk of sequenced bacteria had informal names. In 2010 less than 1% of newly sequenced bacteria had been formerly described.

Dark taxa

For bacteria the graphs are hardly surprising. To get a proper name a bacterium must be cultured, and the vast majority of bacteria haven't been (or can't be) cultured. Hence, microbiologists can gloat at the nomenclatural mess plant and animal taxonomists have to deal with only because microbiologists have a tiny number of names to deal with.

For mammals and invertebrates there's clear a decline in the use of proper names.It would be tempting to suggest that this reflects a decline in the number of taxonomists - there might simply not be enough of them in enough groups to be able to identify and/or describe the taxa being sequenced.

However, if we look at the recent peaks of unnamed animal species, we discover that many have names like Lepidoptera sp. BOLD:AAD7075, indicating that they are DNA Barcodes from the Barcode of Life Data Systems. Of the 62,365 unnamed invertebrates added last year, 54,546 are BOLD sequences that haven't been assigned to a known species. Of the 277 unnamed mammals, 218 are BOLD taxa. Hence, DNA bnacording is flooding Genbank with taxa that lack proper names (and typically are represented by a single DNA bnacode sequence).

There are various ways to interpret these graphs, but for me the message is clear. The bulk of newly added taxa in GenBank are what we might term "dark taxa", that is, taxa that aren't identified to a known species. This doesn't necessarily mean that they are species new to science, we may already have encountered these species before, they may be sitting in museum collections, and have descriptions already published. We simply don't know. As the output from DNA barcoding grows, the number of dark taxa will only increase, and macroscopic biology starts to look a lot like microbiology.


A post-taxonomic world
If we look at the graphs for bacteria, we see that taxonomic names are virtually irrelevant, and yet microbiology seems to be doing fine as a discipline. So, perhaps it's time to think about a post-taxonomic world where taxonomic names, contra Patterson et al., are not that important. We can discover a good deal about organismal biology from GenBank alone (see my post Visualising the symbiome: hosts, parasites, and the Tree of Life for some examples, as well as Rougerie et al. 2010 doi:10.1111/j.1365-294X.2010.04918.x).

This leaves us with two questions:
  1. How much biology can we do without taxonomic names?
  2. If the lack of taxonomic names limits what we can do (and, playing devil's advocate, this is an open question) how can we speed up linking GenBank sequences to names?


I suspect that the answer to (1) is "quite a lot" (especially if we think like microbiologists). Question (2) is ultimately a question about how fast we can link literature, museum collections, sequences, and phylogenies. If progress to date is any indication, we need to rethink how we do this, and in a hurry, because dark taxa are accumulating at an accelerating rate.

How the analyses were done

Although the NCBI makes a dump of its taxonomic database available via FTP (at ftp://ftp.ncbi.nih.gov/pub/taxonomy/), this dump doesn't have dates for when the taxa were added to the database. However, using the Entrez EUtilities we can get the tax_ids that were published within a given date range. For example, to retrieve all the tax_ids added to the database in December 2010, we set the URL parameters &mindate=2010/12/01 and &maxdate=2010-12-31 to form this URL:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=taxonomy&mindate=2010/12/01&maxdate=2010/12/31&retmax=1000000.

I've set &retmax to a big number to ensure I get all the tax_ids for that month (in this case 23511). I then made a local copy of the NCBI database in MySQL ( instructions here) and queried for all species-level taxa in GenBank. I used a rather crude regular expression REGEXP '^[A-Z][a-z]+ [a-z][a-z]+$' to find just those species names that were likely to be proper scientific names (i.e., no "sp.", "aff.", museum or voucher codes, etc.). To group the species into major taxonomic groups I used the division_id.

Results are available in a Google Spreadsheet.

Friday, April 01, 2011

Data matters but do data sets?

Interest in archiving data and data publication is growing, as evidenced by projects such as Dryad, and earlier tools such as TreeBASE. But I can't help wondering whether this is a little misguided. I think the issues are granularity and reuse.

Taking the second issue first, how much re-use do data sets get? I suspect the answer is "not much". I think there are two clear use cases, repeatability of a study, and benchmarks. Repeatability is a worthy goal, but difficult to achieve given the complexity of many analyses and the constant problem of "bit rot" as software becomes harder to run the older it gets. Furthermore, despite the growing availability of cheap cloud computing, it simply may not be feasible to repeat some analyses.

Methodological fields often rely on benchmarks to evaluate new methods, and this is an obvious case where a dataset may get reused ("I ran my new method on your dataset, and my method is the business — yours, not so much").

But I suspect the real issue here is granularity. Take DNA sequences, for example. New studies rarely reuse (or cite) previous data sets, such as a TreeBASE alignment or a GenBank Popset. Instead they cite individual sequences by accession number. I think in part this is because the rate of accumulation of new sequences is so great that any subsequent study would needs to add these new sequences to be taken seriously. Similarly, in taxonomic work the citable data unit is often a single museum specimen, rather than a data set made up of specimens.

To me, citing data sets makes almost as much sense as citing journal volumes - the level of granularity is wrong. Journal volumes are largely arbitrary collections of articles, it's the articles that are the typical unit of citation. Likewise I think sequences will be cited more often than alignments.

It might be argued that there are disciplines where the dataset is the sensible unit, such as an ecological study of a particular species. Such a data set may lack obvious subsets, and hence it makes sense to be cited as a unit. But my expectation here is that such datasets will see limited re-use, for the very reason that they can't be easily partitioned and mashed up. Data sets, such as alignments, are built from smaller, reusable units of data (i.e., sequences) can be recombined, trimmed, or merged, and hence can be readily re-used. Monolithic datasets with largely unique content can't be easily mashed up with other data.

Hence, my suspicion is that many data sets in digital archives will gather digital dust, and anyone submitting a data set in the expectation that it will be cited may turn out to be disappointed.

Mendeley and Web Hooks

Quick, poorly thought out idea. I've argued before that Mendeley seems the obvious tool to build a "bibliography of life." It has pretty much all the features we need: nice editing tools, support for DOIs, PubMed identifiers, social networking, etc.

But there's one thing it lacks. There's not an easy way to transmit updates from Mendeley to another database. There are RSS feeds for groups, such as this one for the "Museum Type Catalogues" group, but that just lists recently added articles. What if I edit an article, say by correcting the authorship, or adding a DOI? How can I get those edits into databases downstream?

One way would be if Mendeley provided RSS feeds for each article, and these feeds would list the edits made to that article. But polling thousands of individual RSS feeds would be a hassle. Perhaps we could have a user-level RSS feed of edits made?

But another way to do this would be with web hooks, which I explored earlier in connection with updating literature within a taxonomic database. The idea is as follows:
  1. I have a taxonomic database that contains literature. It also has a web hook where I can tell the database that a record has been edited elsewhere.
  2. I edit my Mendeley library using the desktop client.
  3. When I've finished all the edits I've made (e.g., DOIs added, etc.), the web hook is automatically called and the taxonomic database notified of the edits.
  4. The taxonomic database processes the edits, and if it accepts them it updates its own records

Several things are needed to make this work. We need to be able to talk about the same record in the taxonomic database and in Mendeley, which means either the database stores the Mendeley identifier, or visa versa, or both. We also need a way to find all the recent edits made in Mendeley. Given that the Mendeley database is stored locally as a SQLite database, one simple hack would be to write a script that was called at a set time, determined which records had been changed (records in the Mendeley SQLite database are timestamped) and send those to the web hook. If we're clever, we may even be able to automate this by calling the script when Mendeley quicks (depending on how scriptable the operating system and application are).

Of course, what would be even better is if the Mendeley application had this feature built in. You supply one or more web hook URLs that Mendeley will call, say after any edits have been synchronised with your Mendeley database in the cloud. More and more I think we need to focus on how we join all these tools and databases together, and web hooks look like being the obvious candidate.

Thursday, March 31, 2011

Paper on NCBI and Wikipedia published in PLoS Currents: Tree of Life

__logo__1.jpg
My paper describing the mapping between NCBI and Wikipedia has been published in PLoS Currents: Tree of Life. You can see the paper here. It's only just gone live, so it's yet to get a PubMed Central number (one of the nice features of PLoS Currents is that the articles get archived in PMC).

Publishing in PLoS Currents: Tree of Life was a pleasant experience. The Google Knol editing environment was easy to use, and the reviewing process quick. It's obviously a new and rather experimental journal, and there are a few things that could be improved. Automatically looking up articles by PubMed identifier is nice, but it would also be great to do this for DOIs as well. Furthermore, the PubMed identifiers aren't displayed as clickable links, which rather defeats the point of having references on the web (I've added DOI links to the articles wherever possible). But, minor grumbles aside, as a way to get an Open Access article published for free, and have it archived in PubMed Central, PLoS Currents is hard to beat. What will be interesting is whether the article receives any comments. This seems to be one area online journals haven't really cracked — providing an environment where people want to engage in discussion.

Monday, March 28, 2011

Linking the NCBI taxonomy to BBC Wildlife Finder




A few weeks ago I spent some time mapping pages from the BBC Wildlife Finder to the equivalent taxa in the NCBI taxonomy. This seemed a useful exercise because the Wildlife Finder pages have some wonderful picture, video, and audio content, as well as other nice features, such as reusing Wikipedia page titles as "slugs" in the BBC page URLs. For example, the Wikipedia page for the Yacare Caiman (Caiman yacare) has the URL http://en.wikipedia.org/wiki/Yacare_Caiman, and the BBC page has the URL http://www.bbc.co.uk/nature/life/Yacare_Caiman. Both share the slug Yacare_Caiman.

After adding these links to iphylo.org/linkout, where you can find them listed on the BBC category page, I've finally uploaded these to the NCBI, so now some 504 NCBI taxon pages have links to high quality multimedia from the BBC.

- Posted using BlogPress from my iPad

Location:Schmiedestraße,Wetter,Germany

Friday, March 25, 2011

Fun things about crustaceans

One side effect of playing with ways to visualise and integrate biology databases is that you stumble across the weird and wonderful stuff that living organisms get up to. My earliest papers were on crustacean taxonomy, so I thought I'd try my latest toy on them.

What lives on crustaceans?

The "symbiome" graph for crustacea shows a range of associations, including marine bacteria (Vibrio), fungi (microsporidians), and other organisms, including other crustacea (crustaceans are at the top of the circle, I'll work on labelling these diagrams a little better).

CrusthostWhat do crustaceans live on?Crustpara

Crustacea (in addition to parasitising other crustacea) parasitise several vertebrates groups, including fish and whales. But they also occur in terrestrial vertebrates. For example, sequence EF583871 is from the pentastomid worm Porocephalus crotali from a dog. When people think of terrestrial crustacea they usually don't think of parasites. There's also a prominent line from crustaceans to what turns out to be corals, representing coral-living barnacles.

It's instructive to compare this with insects, which similarly parasitise vertebrates. The striking difference is the association between insects and flowering plants.

Insect

I guess these really need to be made interactive, so we could click on them and discover more about the association represented by each line in the diagram.

Visualising the symbiome: hosts, parasites, and the Tree of Life

Back in 2006 in a short post entitled "Building the encyclopedia of life" I wrote that GenBank is a potentially rich source of information on host-parasite relationships. Often sequences of parasites will include information on the name of the host (the example I used was sequence AF131710 from the platyhelminth Ligophorus mugilinus, which records the host as the Flathead mullet Mugil cephalus).

I've always wanted to explore this idea a bit more, and have finally made a start, in part inspired by the recent VIZBI 2011 meeting. I've grabbed a large chunk of GenBank, mined the sequences for host records, and created some simple visualisations of what I'm terming (with tongue firmly in cheek) the "symbiome". Jonathan Eisen will not be happy, but I need a word that describes the complete set of hosts, mutualists, symbionts with which an organism is associated, and "symbiome" seems appropriate.

Human symbiome
To illustrate the idea, below is the human "symbiome". This diagram shows all the taxa in GenBank arranged in a circle, with lines connecting those organisms that have DNA sequences where humans are recorded as their host.

Human

At a glance, we have a lot of bacteria (the gray bar with E. coli) and fungi (blue bar with Yeast), and a few nematodes and arthropods.

Fig tree symbiome
Next up are organisms collected from fig trees (genus Ficus).

Ficus
Fig trees have wasp pollinators (the dark line landing near the honey bee Apis), as well as nematodes (dark line landing near Caenorhabditis elegans). There are also some associations with fungi and other arthropods.

Which taxa host insects?
Next up is a plot of all associations involving insects and a host.

Insect
The diagram is dominated by insect-flowering plant interactions, followed by insect-vertebrate associations (most likely bird and mammal lice).

Which taxa are hosted by insects?
We can reverse the question and ask what organisms are hosted by insects:

Insectashost
Lots of associations between insects and fungi, as well as bacteria, and a few other organisms, such as nematodes, and Plasmodium (the organism which causes malaria).

Frog symbiome
Lastly, below is the symbiome of frogs. "Worms" feature prominently, as well as the fungus that causes chytridiomycosis.

FrogHow the visualisation was made

The symbiome visualisations were made as follows. Firstly DNA sequences were downloaded from EMBL and run through a script that extracted as much metadata as possible, including the contents of the host field (where present). I then took the NCBI taxonomy and generated an ordered list of taxa by walking the tree in postorder, which determines where on the circumference of the circle the taxon lies. Pairs of taxa in an association are connected by a quadratic Bezier curve. The illustration was created using SVG.


Next steps
There are several ways this visualisation could be improved. It's based only only a subset of data (I haven't run all of the sequence databases though the parser yet), and the matching of host taxa is based on exact string matching. All manner of weird and wonderful things get entered in the host field, so we'll need some more sophisticated parsing (see "LINNAEUS: A species name identification system for biomedical literature" doi:10.1186/1471-2105-11-85 for a more general discussion of this issue).

The visualisation is fairly crude at this stage. Circle plots like this are fairly simple to create, and pop up in all sorts of situations (e.g., RNA secondary structure methods, which I did some work on years ago). Of course, Circos would be an obvious tool to use to create the visualisations, but the overhead of installing it and learning how to use it meant I took a shortcut and wrote some SVG from scratch.

Although I've focussed on GenBank as a source of data, this visualisation could also be applied to other data. I briefly touched on this in Tag trees: displaying the taxonomy of names in BHL where a page in the Biodiversity Heritage Library contains the names of a flea and it's mammalian hosts. I think these circle plots would be a great way to highlight possible ecological associations mentioned in a text.

Thursday, March 24, 2011

TreeBASE meets NCBI, again

Déjà vu is a scary thing. Four years ago I released a mapping between names in TreeBASE and other databases called TBMap (described here: doi:10.1186/1471-2105-8-158). Today I find myself releasing yet another mapping, as part of my NCBI to Wikipedia project. By embedding the mapping in a wiki, it can be edited, so the kinds of problems I encountered with TbMap, recounted here, here, and here. The mapping in and of itself isn't terribly exciting, but it's the starting point for some things I want to do regarding how to visualise the data in TreeBASE.

Because TreeBASE 2 has issued new identifiers for its taxa (see TreeBASE II makes me pull my hair out), and now contains its own mapping to the NCBI taxonomy, as a first pass I've taken their mapping and added it to http://iphylo.org/linkout. I've also added some obvious mappings that TreeBASE has missed. There are a lot more taxa which could be added, but this is a start.

The TreeBASE taxa that have a mapping each get their own page with a URL of the form http://iphylo.org/linkout/<TreeBase taxon identifier>, e.g. http://iphylo.org/linkout/TB2:Tl257333. This page simply gives the name of the taxon in TreeBASE and the corresponding NCBI taxon id. It uses a Semantic Mediawiki template to generate a statement that the TreeBASE and and NCBI taxa are a "close match". If you go to the corresponding page in the wiki for the NCBI taxon (e.g., http://iphylo.org/linkout/Ncbi:448631) you will see any corresponding TreeBASE taxa listed there. If a mapping is erroneous, we simply need to edit the TreeBASE taxon page in the wiki to fix it. Nice and simple.

At the time of writing the initial mapping is still being loaded (this can take a while). I'll update this post when the uploading has finished.

Sunday, March 20, 2011

VIZBI 2011

broad.jpg
I've spent the last three days at VIZBI, a Workshop on Visualizing Biological Data, held at the Broad Institute in Boston (note that "Broad" rhymes with "Code"). A great conference in a special venue that includes the DNAtrium. Videos of the talks will be online "real soon now", look for the keynotes, which were full of great ideas and visualisations. To get a flavour of the meeting search for the hashtag #vizbi on Twitter (you can also see the tweet stream on the VIZBI home page). All the keynotes were great, but I personally found Tamara Munzer's the most enlightening. She drew on lots of research in visual perception to outline what works and what doesn't when presenting information visually. You can grab a PDF of her presentation here.

One aspect of the meeting which worked really well was the poster presentations. Poster sessions were held during coffee breaks, and after the last talk of the session but before the audience broke for coffee, each author of a poster got 90 seconds to introduce their poster (there were typically around 10 posters per break). This meant the poster authors got a chance to introduce themselves and their work to the workshop audience, and the audience could discover what posters were being displayed. Neat idea.

I gave a presentation on phylogenies, which I've put on slideshare. After explaining that I thought phylogeny visualisation was mostly a solved problem (as evidenced by the large number of tree viewers available), I continued the theme of why I don't think 3D works for phylogeny (except for geophylogenies), made the pitch for building a phylogeny viewer on the iPad, and finished with my recent work on Google Maps-style viewing very large trees.

Friday, March 11, 2011

Geography and genes: zoomable view of frog NCBI classification with linked map

More zoom viewer experiments (see previous post), this time with a linked map that updates as you browse the tree (SVG-capable browser required). As you browse the frog classification the map updates to show the location of georeferenced sequences in GenBank from the taxa in the part of the tree you are looking at. The map is limited to not more than 200 localities, and many frog sequences aren't georeferenced, but it's a fun way to combine classification and geography. You can try it at:

http://iphylo.org/~rpage/deeptree/7.html

or watch the video:

Tuesday, March 08, 2011

The Mendeley API Binary Battle - win $US 10,001

Now we'll bring the awesome. Mendeley have announced The Mendeley API Binary Battle, with a first prize of $US 10,0001, and some very high-profile judges (Juan Enriquez, Tim O'Reilly, James Powell, Werner Vogels, and John Wilbanks). Deadline for submission is August 31st 2011, with the results announced in October.

The criterion for judging are:
  1. How active is your application? We’ll look at your API key usage.

  2. How viral is the app? We’ll look at the number of sign ups on Mendeley and/or your application, and we’ll also have an eye on Twitter.

  3. Does the application increase collaboration and/or transparency? We’ll look at how much your application contributes to making science more open.

  4. How cool is your app? Does it make our jaws drop? Is it the most fun that you can have with your pants on? Is it making use of Facebook, Twitter, etc.?

  5. The Binary Battle is open to apps built previous to this announcement.


Start your engines...

Monday, March 07, 2011

Nomenclator Zoologicus meets Biodiversity Heritage Library: linking names directly to literature

Following on from my previous post on microcitations I've blasted all the citations in Nomenclator Zoologicus through my microcitation service and created a simple web site where these results can be browsed.

The web site is here: http://iphylo.org/~rpage/nz/.

To create it I've taken a file dump of Nomenclator Zoologicus provided by Dave Remsen and run all the citations through the microcitation service, storing the results in a simple database. You can search by genus name, author and year, or publication. The search is pretty crude, and in the case of publications can be a bit hit and miss. Citations in Nomenclator Zoologicus are stored as strings, so I've used some crude rules to try and extract the publication name from the rest of the details (such as page numbering).

To get started, you can look at names published by published by Distant in 1910, which you can see below:

Nz1

If the citation has been found you can click on the icon to view the page in a popup, like this:

Nz2

You can also click on the page number to be taken to that page in BHL.


I've also added some other links, such as to the name in the Index to Organism Names, as well as bibliographic identifiers such as DOIs, Handles, and links to JSTOR and CiNii.

So far only 10% of Nomenclator Zoologicus records have a match in BHL, which is slightly depressing. Browsing through there are some obvious gaps where my parser clearly failed, typically where multiple pages are included in the citation, or the citation has some additional comments. These could be fixed. There are also cases where the OCR text is so mangled that a match has been rejected because the genus name and text were too different.

This has been hastily assembled, but it's one vision of a simple service where we can go from genus name to being able to see the original publication of that name. There are other things we could do with this mapping, such as enabling BHL to tell users that the reference they are looking at is the original source of a particular name, and enabling services that use BHL content (such as EOL and Atlas of Living Australia to flag which reference in BHL is the one that matters in terms of nomenclature.

Thursday, March 03, 2011

Microcitations: linking nomenclators to BHL

One of the challenges of linking databases of taxonomic names to the primary literature is the minimal citation style used by nomenclators (see my earlier post Nomenclators + digitised literature = fail).

For example, consider Nomenclator Zoologicus. Volumes 1-10 of this list of generic names in zoology were digitised in 2004 and put online by uBio (for more details of this project see Taxonomic informatics tools for the electronic Nomenclator Zoologicus, pmid:16501061). In Nomenclator Zoologicus the citation for the genus Abana is:

Ann. Mag. nat. Hist., (8) 2, 72.

The challenge is to link this short citation to the digital version of the corresponding article. I've been sitting on a copy of the digitised Nomenclator Zoologicus kindly provided by Dave Remsen, and I've finally started to look at the problem of mining it for links to databases such as BHL.

You can see the first attempt at http://biostor.org/microcitation.php. This form takes a genus name and the short citation and attempts to locate the corresponding page in BHL. It then checks whether the name is present on that page. Locating a page in a journal can be a challenge given the often rather ropey metadata in BHL, but BioStor uses a combination of fuzzy string matching and crude kludges to find the best match. But a further complication is that OCR errors may mean the taxonomic name we are looking for might not be detected on the page.

For example, if we search for the citation for the genus Aethriscus, Ann. Mag. nat. Hist., (7) 10, 329. we find two candidate pages in the journal Ann. Mag. nat. Hist, but neither contains the string "Aethriscus". However, if we use approximate string matching we find the OCR text for one page has the string "thriscus". This differs by only two characters from "Aethriscus", and so is a possible match (shown in orange).

2.png

Looking at the scanned page we can see the likely source of the problem:
3.png

In the original publication the name Aethriscus was written as Æthriscus. The ligature Æ has been corrupted by the OCR engine, and in Nomenclator Zoologicus the name is written without the ligature, hence the failure to exactly match the name with the text. These are some of the challenges faced when trying to close the circle and link names to literature.

The microcitation parser is still pretty crude, but usable. You can get results in either HTML or JSON, so the task of mapping microcitations to BHL pages can be automated. At present the name matching assumes you are looking at a single word (e.g., a genus), I need to extend it to handle binomials.

BioStor updates on Twitter

BioStor has had a Twitter account @biostor_org for a while, but it's not been active. I finally got around to hooking it up to BioStor, so that now every time an article is added to BioStor, the title of that article and it's URL appears in the @biostor_org Twitter feed.



Activity on this feed will be variable, depending on whether articles are being added manually, or in bulk. But it's a handy way to keep tabs on the growing number of articles being harvested from the Biodiversity Heritage Library.

Tuesday, March 01, 2011

Zooming a large tree, now with thumbnails

Continuing experiments with a zoom viewer for large trees (see previous post), I've now made a demo where the labels are clickable. If the NCBI taxon has an equivalent page in Wikipedia the demo displays and link to that page (and, if present, a thumbnail image). Give it a try at

http://iphylo.org/~rpage/deeptree/3.html

or watch the short video clip below:

Mendeley, OpenURL, BioStor, and BHL

Mendeley has added a feature which makes it easier to use Mendeley with repositories such as BioStor and BHL. As announced in Get Full Text: Mendeley now works with your local library via OpenURL, you can now add OpenURL resolvers to your Mendeley account:
We’ve added a button to the catalog pages that will allow you to get the article from your library right in Mendeley. This feature will link you directly to the full text copy according to your institutional access rights.
Ironically, in the UK access to electronic articles from a University is pretty seamless via the UK Access Management Federation, so I don't need to add an OpenURL resolver to get full text for an article. But this new feature does enable another way to access to articles in my BioStor repository. By adding the BioStor OpenURL to your Mendeley account, you can search for articles from your Mendeley library in BioStor.

The Mendeley blog post explains how to set up an OpenURL resolver. Go to your Mendeley account and click on the My Account button in the upper right corner of then page, then select Account Details, then the Sharing/Importing tab, or just click here.

openurl_settings.jpg

Click on Add library manually, then enter the name of the resolver (e.g., "BioStor") and the URL http://biostor.org/openurl:

Snapshot 2011-03-01 07-37-20.png

If you view a reference in Mendeley, you will now see something like this:

Snapshot 2011-03-01 07-40-04.png

In addition to the DOI and the URL, this reference now displays a Find this paper at menu. Clicking on it shows the default services, together with any OpenURL resolvers you've added (in this case, BioStor):
Snapshot 2011-03-01 07-42-50.png

You can add multiple resolvers, so we could add the BHL OpenURL resolver http://www.biodiversitylibrary.org/openurl, although finding articles isn't BHL OpenURL resolver's strong point.

Now, what would be very handy is if Mendeley were to complete the circle by providing their own OpenURL resolver, so that people could find articles in Mendeley from metadata such as article title, journal, volume, and starting page. The Mendeley API might be a way to implement this, although its search features lack the granularity needed.

Monday, February 28, 2011

Live demo of zooming a large tree

After the teaser on Friday (see Deep zooming a large 2D tree) I've put a live demo of my experiments with viewing a large tree online at:

http://iphylo.org/~rpage/deeptree/

The first example (Experiment 1) is the NCBI classification for frogs:

This version displays internal node labels, leaf labels (as many as can be displayed at a given zoom level), and works in Safari, Firefox, and Internet Explorer 8. Obviously this is all pretty rough, but take it for a spin, I'd welcome any feedback.

Friday, February 25, 2011

Deep zooming a large 2D tree

Here's a quick demo of a 2D large tree viewer that I'm working on. The aim is to provide a simple way to view and navigate very large trees (such as the NCBI classification) in a web browser using just HTML and Javascript. At the moment this is simply a viewer, but the goal is to add the ability to show "tracks" like a genome browser. For example, you could imagine columns appearing to the right of the tree showing you whether there are phylogenies available for these taxa in TreeBASE, images from Wikipedia, sparklines for sequencing activity over time, etc. I'll blog some more on the implementation details when I get the chance, but it's pretty straightforward. Image tiles are generated from SVG images of tree using ImageMagick, labelling is applied on the fly using GIS-style queries to a MySQL database that holds the "world coordinates" of the nodes in the tree (see discussion of world coordinates on Google's Map API pages), and the zooming and tile fetching is based on Michal Migurski's Giant-Ass Image Viewer. Once I've tidied up a few things I'll put up a live demo so people can play with it.

Thursday, February 24, 2011

Why 3D phylogeny viewers don't work

Matt Yoder (@mjyoder had a Twitter conversation yesterday about phylogeny viewers, prompted by my tweeting about my latest displacement activity, a 2D tree browser using the tiling approach made popular by Google Maps.

As part of that conversation, Matt tweeted:
RT @rdmpage: @mjyoder - I think 3D is the worse thing we could do, there's no natural mapping to 3D. <- meh, where's the imagination?

Well, Matt's imagination has gone into overdrive, and he's blogged about his ideas.

3d_tree_browsing.jpg


This issue deserves more exploration, but here are some quick thoughts. 3D has been used in a number of phylogeny browsers, such as Mike Sanderson's Paloverde, Walrus, and the Wellcome Trust's Tree of Life. I don't find any terribly successful, pretty as they may be. I think there are several problems with trees in general, and 3D versions in particular.

Trees aren't real
Trees aren't real in the same way that the physical world is (or even imagined physical worlds). Trees are conceptual structures. The history of web interfaces is littered with attempts to visualise conceptual space, for example to summarise search results. These have been failures, a simple top ten list as used by Google wins. I don't think this is because Google's designers lack imagination, it's because it works. Furthermore, this is actually a very successful visualisation:


I think elaborate attempts to depict conceptual spaces on screens are mostly going to fail.

Trees are empty
Compared to, say, a geographic map, trees are largely empty space. In a map every pixel counts, in that it potentially represents something. Think of the satellite view in Google Maps. Each pixel on the screen has information. Trees are largely empty, hence much of the display space is wasted. Moving trees to 3D just gives us more space to waste.

Trees don't have a natural ordering
Even if we accept that trees are useful visualisations, they have problems. Given the tree ((1,2),(3,4)); we have a lot of (perhaps too much) freedom in how we can depict that tree. For example, both diagrams below depict this tree. In the x-axis there is a partial order of internal nodes (the ancestor of {1,2} must be to the right of the ancestor {1,2,3,4}), but the tree ((1,2),(3,4)); says nothing about the relative ordering of {1,2} versus {3,4}. We are free to choose. A natural linear ordering would be divergence time, but estimates of those times can be contested, or unavailable.

order.png


Phylogenies are unordered trees in the sense that I can rotate any node about it's ancestor and still have the same tree (compare the two trees above). Phylogenies are like mobiles:


The practical consequence of this is that different tree viewers can render the same tree in very different ways, making navigation across viewers unpredictable. Compare this to maps. Even if I use different projections, the maps remain recognisably similar, and most maps retain similar relationships between areas. If I look at a map of Glasgow and move left I will end up in the Atlantic Ocean, no matter if I use Google Maps or Microsoft Maps. Furthermore, trees grow in a way that maps don't (at least, not much). If I add nodes to a tree it may radically change shape, destroying navigation cues that I may have relied on before. Typically maps change by the addition of layers, not by moving bits around (paleogeographic maps excepted).

Trees aren't 3D
There's nothing intrinsically 3D about trees, which means any mapping to 3D space is going to be arbitrary. Indeed, most 3D viewers simply avoid any mapping and show a 2D tree in 3D space, which seems rather pointless.

Perhaps it's because I don't play computer games much (went through an Angry Birds phase, and occasionally pick up an X-Box controller, only to be mercilessly slaughtered by my son), but I'm not inspired by the analogy with computer games. I'm not denying that there are useful things to learn from games (I'm sure the controls in Google Earth owe something to games). But games also rely on a visceral connection with the play, and an understanding of the visual vocabulary (how to unlock treasure, etc.). Matt's 3D model requires users to learn a whole visual vocabulary, much of which (e.g., "Fruit on your tree? Someone has left comment(s) or feedback. ") seems forced.

My sense is that the most successful interfaces make the minimal demands on users, don't fight their intuition, and don't force them to accept a particular visualisation of their own cognitive space.

I'll write more about this once I get my 2D tree viewer into shape where it can be shown. It will be a lot less imaginative than Matt's vision, all I'm shooting for is that it is usable.




Friday, February 18, 2011

Why metadata matters

Quick note to express the frustration I experience sometimes when dealing with taxonomic literature. As part of a frankly Quixotic desire to link every article cited in the Australian Faunal Directory (AFD) to the equivalent online resource (for example, in the Biodiversity Heritage Library using BioStor, or to a publisher web site using a DOI) I sometimes come across references that I should be able to find yet can't. Often it turns out that the metadata for the article is incorrect. For example, take this reference:
Report upon the Stomatopod crustaceans obtained by P.W. Basset-Smith Esq., surgeon R.N. during the cruise, in the Australia and China Sea, of H.M.S. "Penguin", commander W.V. Moore. Ann. Mag. Nat. Hist. Vol. 6 pp. 473-479 pl. 20B
which is in the Australian Faunal Directory (urn:lsid:biodiversity.org.au:afd.publication:087892ae-2134-4bb4-83ae-8b8cbd15b299). Using my OpenURL resolver in BioStor I failed to locate this article. Sometimes this is because the code I used to parse references from AFD mangles the reference, but not in this case. So, I Google the title and find a page in the Zoological catalogue of Australia: Aplacophora, Polyplacophora, Scaphopoda:


Here's the relevant part of this page:
Zoocat
Same as AFD, Ann. Mag. Nat. Hist. volume 6, pages 473-479, 1893.

In despair I looked at the BHL page for The Annals and Magazine of Natural History and discover that there is no volume 6 published in 1893. There is, however, series 6. Oops! Browsing the BHL content I discover the start of the article I'm looking for on BHL page 27734740 , volume 11 of series 6 of The Annals and Magazine of Natural History. Gotcha! So, I can now link AFD to BHL like this.

I should stress that in general AFD is an great resource for someone like me trying to link names to literature and, to be fair, with its reuse of volume numbers across series The Annals and Magazine of Natural History can be a challenge to cite. Usually the bibliographic details in AFD are accurate enough to locate articles in BHL or CrossRef, but every so often references get mangled, misinterpreted, or someone couldn't resist adding a few "helpful" notes to a field in the database, resulting in my parser failing. What is slightly alarming is how often when I Google for the reference I find the same, erroneous metadata repeated across several articles. This, coupled with the inevitable citation mutations can make life a little tricky. The bulk of the links I'm making are constructed automatically, but there are a few cases where one is lead on a wild goose chase to find the actual reference.

Although this is an example of why it matters to have accurate metadata, it can also be seen as an argument for using identifiers rather than metadata. If these references had stable, persistent identifiers (such as DOIs) that taxonomic databases cited, then we wouldn't need detailed metadata, and we could avoid the pain of rummaging around in digital archives trying to make sense of what the author meant to cite. Until taxonomic databases routinely use identifiers for literature, names and literature will be as ships that pass in the night.

Sunday, February 06, 2011

Why is the Atlas of Living Australia is invisible to Google?

Jeff Atwood, one of the co-founders of Stack Overflow recently wrote a blog post Trouble In the House of Google, where he noted that several sites that scrape Stack Overflow content (which Stack Overflow's CC-BY-SA license permits) appear higher in Google's search rankings than the original Stack Overflow pages. When Stack Overflow chose the CC-BY-SA license they made the assumption that:
...that we, as the canonical source for the original questions and answers, would always rank first...That's why Joel Spolsky and I were confident in sharing content back to the community with almost no reservations – because Google mercilessly penalizes sites that attempt to game the system by unfairly profiting on copied content.
Jeff Atwood's post goes on to argue that something is wrong with the way Google is ranking sites that derive content from other sites.

I was reminded of this post when I started to notice that searches for fairly obscure Australian animals would often return my own web site Australian Faunal Directory on CouchDB as the first hit. In one sense this is personally gratifying, but it can also be frustrating because when I Google these obscure taxa it's usually because I'm trying to find data that isn't already in one of my projects.

unotata.pic1.JPGBut what I've also noticed is that the site that I obtained the data from, Australian Faunal Directory (AFD), rarely appears in the Google search results. In fact, there are taxa for which Google doesn't find the corresponding page in AFD. For example, if you search for Uxantis notata (shown here in an image from the Key to the planthoppers of Australia and New Zealand) the first hit(s) are from my version of AFD:
Snapshot 2011-02-06 14-05-44.png


Neither the original AFD, nor the Atlas of Living Australia (ALA), which also builds on AFD, appear in the top 10 hits.

Initially I though this is probably an artefact. This is a pretty obscure taxon, maybe things like rounding error in computing PageRank are going to affect search rankings more than anything else. However, if I explicitly tell Google to search for Uxantis notata in the domain environment.gov.au I get no hits whatsoever:

Snapshot 2011-02-06 14-10-32.png

Likewise, the same search restricted to ala.org.au finds nothing, nothing at all. Both AFD and Atlas of Living Australia have pages for this taxon, here, and here, so clearly something is deeply wrong.

Why are the original providers of the data not appearing in Google search results at all? For someone like me who argues that sharing data is a good thing, and sites that aggregate and repurpose data will ultimately benefit the original data providers (for example by sending traffic and Google Juice) this is somewhat worrying. It seems to reinforce the fear that many data providers have: "if I share my data someone will make a better web site than mine and people will go to that web site, rather than the one I've created with my hard-won data." It may well be that data aggregators will score higher than data providers in Google searches, but I hadn't expected data providers to be virtually invisible.

atlasaustraliasm.gifGoogle isn't the problem
If a web site that I hacked together in a few days does better in Google searches than the rather richer pages published by sites such as ALA (with a budget of over $AU 30 million), something is wrong. Unlike the Stack Overflow example discussed above, I don't think the problem here is with Google.
If we search in Google for an "iconic" Australian taxon by name, say the Koala Phascolarctos cinereus, Wikipedia is the first hit (which should be no surprise). ALA doesn't appear in the top ten. If we tell Google to just search the domain ala.org.au we get lots of pages from ALA, but not the actual species page for Phascolarctos cinereus. This suggests that there is something about the way ALA's website works that prevents Google indexing it properly. I'm also a little worried that a major biodiversity project which has as its aim
...to improve access to essential information on Australia’s biodiversity
is effectively invisible to Google.



Friday, February 04, 2011

Web Hooks and OpenURL: the screencast

Yesterday I posted notes on Web Hooks and OpenURL. That post was written when I was already late (you know, when you say to yourself "yeah, I've got time, it'll just take 5 minutes to finish this..."). The Web Hooks + OpenURL project is still very much a work in progress, but I thought a screen cast would help explain why I think this is going to make my life a lot easier. It shows an example where I look at a bibliographic record in one database (AFD, the Australian Faunal Directory on CouchDB), click on a link that takes me to BioStor — where I can find the reference in BHL — then simply click on a button on the BioStor page to "automagically" update the AFD database. The "magic" is the Web Hook. The link I click on in the AFD database contains the identifier for that entry in the AFD, as well a a URL BioStor can call when it's found the reference (that URL is the "web hook").

Using Web Hooks and OpenURL from Roderic Page on Vimeo.



Thursday, February 03, 2011

Web Hooks and OpenURL: making databases editable

For me one of the most frustrating things about online databases is that they often can't be edited. For example, I've recently created a version of the Australian Faunal Directory on CouchDB, which contains a list of all animals in Australia, and a fairly comprehensive bibliography of taxonomic publication on those animals. What I'd like to do is locate those publications online. Using various scripts I've found DOIs for some 2,500 articles, and located nearly 4,900 article in BHL, and added these to the database, but browsing the database (using, say, the quantum treemap interface) makes it clear there are lots of publications that I've missed.

It would be great if I could go to the Australian Faunal Directory on CouchDB and edit these on that site, but that would require making the data editable, and that means adding a user interface. And that's potentially a lot of work. Then, if I go to another database (say, my CouchDB version of the Catalogue of Life) and want to make that editable then I have to add an interface to that database as well. I could switch to using a wiki, which I've done for some projects (such as the NCBI to Wikipedia mapping), but wikis have their own issues (in particular, they don't easily support the kinds of queries I want to do).

There is, as they say, a third way: web hooks. I first came across web hooks when I discovered that Post-Commit Web Hooks in Google Code. The idea is you can create a web service that gets called every time you commit code to the Google Code repository. For example, each time you commit code you can call a web hook that uses the Twitter API to tweet details of what you just committed (I tried this for a while, until some of my Twitter followers got seriously pissed off by the volume of tweets this was generating).

What has this to do with making databases editable? Well, imagine the following scenario. A web page displays a publication, but no DOI. However, the web page embeds an OpenURL in the form of a COinS (in other words, a URL with key-value pairs describing the publication). If you use a tool such as the OpenURL Referrer in Firefox you can use an OpenURL resolver to find that publication. Examples of OpenURL resolvers include bioGUID and BioStor. Let's say you find the publication, and it has a DOI. How do you tell the database about this? Well, you can try and find an email address of someone running the database so you can send them the information, but this is a hassle. What if the OpenURL resolver that you used to find the DOI could automatically tell the source database that it's found the DOI? That's the idea behind web hooks.

I've started to experiment with this, and have most of the pieces working. Publication pages in Australian Faunal Directory on CouchDB have COinS that include two additional pieces of information: (1) the database identifier for the publication (in this case a UUID, in the hideously complex jargon of OpenURL this the "Referring Entity Identifier"), and (2) the URL of the web hook. The idea is that an OpenURL resolver can take the OpenURL and try and locate the article. If it succeeds it will call the web hook URL supplied by the database, tell it "hey, I've found this DOI for the publication with this database identifier". The database can then update its data, so the next time a user visits the page for that publication in the database, the user will see the DOI. This has the huge advantage over tools that just modify the web page on the fly, such as David Shorthouse's reference parser of persistence: the database itself is updated, not just the web page.

In order to make this work, all the database needs to do is have a web hook, namely a URL that accepts POST requests. The heavy lifting of searching for the publication, or enabling users to correct and edit the data can be devolved to a single place, namely the OpenURL resolver. As a first step I'm building an OpenURL resolver that displays a form the in which the user can edit bibliographic details, and launch searches in CrossRef (and soon BioStor). When the user is done they can close the form, which is when it calls the web hook with the edited data. The database can then choose to accept or reject the update.

Given that it's easy to create the web hook, and trivial to get a database to output an OpenURL with its internal identifier and the URL of the web hook, this seems like a light-weight way of making databases editable.

Tuesday, January 18, 2011

Quantum treemaps meet BHL and the Australian Faunal Directory

One of the things I'm enjoying about the Australian Faunal Directory on CouchDB is the chance to play with some ideas without worrying about breaking lots of code or, indeed, upsetting any users ('cos, let's face it, there aren't any). As a result, I can start to play with ideas that may one day find their way into other projects.

One of these ideas is to use quantum treemaps to display an author's publications. For example, below is a treemap showing publications by G A Boulenger in my Australian Faunal Directory on CouchDB project. The publications are clustered by journal. If a publication has been found in BioStor the treemap displays a thumbnail of that publication, otherwise it shows a white rectangle. At a glance we can see where the gaps are. You can view a publication's details simply by clicking on it.

boulenger.png

The entomologist W L Distant has a more impressive treemap, and clearly I need to find quite a few of his publications.
distant.png
I quite like the look of these, so may think about adding this display to BioStor. I may also think about using treemaps in my ongoing iPad projects. If you want to see where I'm going with this then take a look at Good et al. A fluid treemap interface for personal digital libraries.

Notes
The quantum treemap is computed using some rather ugly PHP I wrote, based on this Java code. I've not implemented all the refinements of the original Java code, so the quantum treemaps I create are sometimes suboptimal. To avoid too much visual cluster I haven't drawn a border around each cell, instead I use CSS gradients to indicate the area of the cell (if you're using Internet Explorer the gradient will be vertical rather than going from top left to bottom right). The journal name is overlain on the cell contents, but if you are using a decent browser (i.e., not Internet Explorer) you can still click through this text to the underlying thumbnail because the text uses the CSS property
.overlay { pointer-events: none; }
I learnt this trick from the Stack Overflow question Click through div with an alpha channel.

Friday, January 14, 2011

The demise of phthiraptera.org and the perils of using Internet domain names as identifiers

When otherwise sensible technorati refer to "owning" a domain name, it makes me want to stick forks in my eyeballs. We do not "own" domain names. At best, we only lease them and there are manifold ways in which we could lose control of a domain name - through litigation, through forgetfulness, through poverty, through voluntary transfer, etc. Once you don't control a domain name anymore, then you can't control your domain-name-based persistent identifiers either. - Geoffrey Bilder interviewed by Martin Fenner
Geoffery Bilder's comments about the unsuitability of URLs as long term identifiers (as opposed, say, to DOIs) came to mind when I discovered that the domain phthiraptera.org is up for sale:

Snapshot 2011-01-14 07-47-39.png

This domain used to be home to a wealth of resources on lice (order Phthiraptera). I discovered that ownership of the domain had expired when a bunch of links to PDFs returned by an iSpecies search for Collodennyus all bounced to the holding page above. Phthiraptera.org was owned by the late Bob Dalgleish. After his death, ownership of the domain lapsed, and it's now up for sale. Although much of the content of Phthiraptera.org has been moved to phthiraptera.info, URLs containing phthiraptera.org still turn up in search results, especially ones that have been cached (for example, in iSpecies). Given that much of the content is still available the loss isn't total, but anyone relying on links containing phthiraptera.org to point to content (such as a PDF), or to identify that content (such as a publication) will find themselves in trouble. Although ideally Cool URIs don't change, in practice they do, and with alarming frequency. Furthermore, in this case, because ownership of phthiraptera.org has lapsed, there's no opportunity to create redirects from URLs with phthiraptera.org to the equivalent content in phthiraptera.info (leaving aside the issue that phthiraptera.info is not a mirror of phthiraptera.org, so exactly what the redirects would point to is unclear).

Identifiers based on domain names, such as URLs and LSIDs are attractive because the DNS helps ensure global uniqueness, and HTTP provides a way to resolve the identifier, but all this is contingent on the domain itself persisting. For more on this topic I recommend reading Martin Fenner's interview of CrossRef's Geoffrey Bilder, from which I took the opening quote.

Tuesday, January 11, 2011

Why won't The Plant List won't let me do this?

In my last post I discussed why I thought the decision of The Plant List to use a restrictive license (CC-BY-NC-ND) was such a poor choice. CC-BY-NC-ND states that
You may not alter, transform, or build upon this work.
To make this point more concrete, I've created this site:

Experiments with The Plant List

to show the kinds of things that The Plant List's choice of license prevents the taxonomic community from doing. As a first step I'm exploring linking the names in the list to the primary scientific literature, as this video demonstrates:

The Plant List from Roderic Page on Vimeo.


For example, we can take a name like Begonia zhengyiana Y.M.Shui, parse the bibliographic citation provided by The Plant List (via IPNI), and locate the actual paper online, in this case it's freely available as a PDF:



Now we can see a drawing of the plant, and instead of simply trusting that the compilers of The Plant List have correctly interpreted this paper, we can see for ourselves. Down the track, we could imagine mining this paper for details about the plant, such as its morphology and geographic distribution. This requires the link to the original literature, which The Plant List lacks.

A good chunk of the recent plant taxonomic literature has DOIs, for example journals such as the Kew Bulletin and Novon. Playing with some scripts I've managed to associate nearly 9000 accepted names with a DOI, and that's by looking at only a few journals. There are lots more DOIs to be found, but because of the way botanical nomenclators record references (see my post Nomenclators + digitised literature = fail) it can be something of a challenge to find them. This task isn't helped by the fairly lax way some publishers enter data in CrossRef (Cambridge University Press I'm looking at you). The other obvious source of digitised literature is, of course, BHL, and that's next on the list of resources to play with.

Experiments with The Plant List is very crude, and I've barely scratched the surface of linking names to primary literature. That said, given that there are exactly zero links between names and digital literature in The Plant List, I'd argue that my site adds value to the data in that The Plant List. And that's my point — by making data available for others to play with, you enable others to add value to that data. By choosing a CC-BY-NC-ND license, The Plant List has killed that possibility.

So, my question for The Plant List is "why did you do that?"