Friday, August 06, 2010

Extracting semantic goodness from Zootaxa articles

zootaxa.png

I've just come back from a holiday in New Zealand, during which time I spent a morning chatting with Zhi-Qiang Zhang (@Zootaxa, editor of Zootaxa) and Stephen Thorpe (stho002, a major contributor to Wikispecies).

Fresh from playing with PLoS XML to explore ways of redisplaying articles (described in my commentary on the PLoS iPad app), I was extolling the virtues of the XML mark-up that underlies PLoS (and other Open Access journals, such as the BMC series). These publishers provide Open Access XML versions of their papers that are quite richly marked up: internal citations, links to figures, the bibliography, etc. are all clearly identified, although they don't have the semantic mark-up of TaxPub, used in some recent Zookeys papers.

Talking to Zhi-Qiang Zhang is always a useful reality check. Zootaxa describes itself as the
World's foremost journal in taxonomy; publisher of 15,421 new taxa in 141,518 pages by 7,385 authors worldwide since 2001

This is taxonomic publishing on a grand scale, averaging more than an article a day. Since 2004 Zootaxa has published 12.60% percent of the new taxa recorded in Zoological Record, an order of magnitude more it's nearest rival. The journal is being tightly run, and doesn't have cash to spare (it has nothing like the funding PLoS has, for example). Any change to the basic work flow (author submits Word file, this is imported into Adobe Framemaker, which creates the PDF files displayed on the Zootaxa web site) requires compelling justification. Furthermore, any change would have to scale. The level of work required to embellish articles using custom mark-up, such as TaxPub, just isn't feasible.

Zhi-Qiang waxed enthusiastically about Google Books' interface, where basic information such as keywords, geographic location, and references are extracted automatically. Google Books was one inspiration for the article display I use in BioStor, so I wondered how hard it would be to take some of the work I've been doing on BioStor and on adding mark-up to PLoS XML and apply it to Zootaxa PDFs. After some fussing with regular expressions, the bioGUID OpenURL resolver and uBio's FindIT taxonomic name tool, I've some scripts that automate extracting basic information from a Zootaxa PDF, such as the abstract, localities, taxonomic names, GenBank sequences, and the bibliography. You can see some examples at http://iphylo.org/~rpage/zootaxa/. It's all a bit crude, and isn't the same as being able to mark-up the actual text (which could be done, but with rather more effort), but there's potential here to create nice interfaces to Zootaxa papers, as well as extract the data needed to do some interesting queries.