Tuesday, June 08, 2010

Linking NCBI to Wikipedia

180px-Sphaerius.acaroides.Reitter.tafel64.jpgIn an earlier post I discussed linking NCBI taxonomy to Wikipedia. One way to tackle this is to add NCBI Taxonomy ID to Wikipedia pages. I reopened the case for adding the Taxonomy IDs to the Taxobox on each taxon page, but this met with substantial resistance. A modified proposal to add them elsewhere to the Wikipedia page seems to be gaining more support (or, at least, less vigorous resistance).

Meanwhile, there are other things that need to be done to linking NCBI and Wikipedia. One is to add Wikipedia page names to NCBI Linkout so that when viewing a NCBI taxon page you will see a link to Wikipedia if a page for the corresponding taxon exists. To create this linkout we need a mapping from NCBI to Wikipedia, and that's what I've been working on for the last few days.

The mapping is still in progress, but essentially I've taken a dump of the NCBI taxonomy for June 3, 2010, and matched the names with those in a the June 18, 2009 dump of Wikipedia that I've analysed elsewhere on this blog. I'll detail the various steps in the mapping elsewhere (there are issues such as synonyms, homonyms, Wikipedia redirects, etc.), but for now things seem to be working reasonably well.

The mapping is being created in a Semantic Mediawiki at http://iphylo.org/linkout/. When complete you will be able to up a NCBI taxon by either it's name (including synonyms and common names) or it's NCBI Taxonomy ID. Where possible I'm mapping the NCBI taxon to Wikipedia, and providing a snippet of text and an image.

I've also extracted bibliographic information from the citations.dmp file that comes with the NCBI dump. This contains the comments that you sometimes see on a taxon page. In a few cases I've added some information manually. For example, the beetle genus Sphaerius has a rather complicated nomenclatural history, which the NCBI page summarises as:
Due to a recent ruling (ICZN 2000), the family and generic names Sphaeriusidae Erichson, 1845, and Sphaerius Waltl, 1838, are both available names and have priority over Microsporidae Crotch, 1873 for the family name and Microsporus Kolenati, 1846 for the single included genus, respectively.

By looking through BioStor I've found some of the papers relating to this ICZN ruling, and added them to the wiki page http://iphylo.org/linkout/Ncbi:174920 (aficionados of zoological nomenclature may enjoy the complexity of the case, due to homonymy between the corresponding family name, Sphaeriidae, and a mollusc family of the same name).

Once thus mapping is complete, it will be time to think of how to get this into NCBI's Linkout, and also how to automatically update the mapping to reflect the growth of both the NCBI taxonomy and Wikipedia. If you visit http://iphylo.org/linkout/ please be aware that the mapping is still being written to the wiki (this is being done via API calls, and adding some 900,000 pages is going to take a while).