iPhylo: Wikipedia

Roderic D. M. Page

Showing posts with label Wikipedia. Show all posts

Tuesday, May 28, 2019

Ozymandias meets Wikipedia, with notes on natural language generation

I've tweaked Ozymandias to now include short natural language summaries (snippets) for various taxa. This makes the output a little more friendly and informative. For example, here's a snippet from the page on Cephalodesmius, a dung beetle that makes its own dung.

These snippets come from Wikipedia, well actually, from the DBpedia project. Behind the scenes I have a script that takes the GBIF taxon id for an ALA taxon (if it exists), queries Wikidata for the corresponding taxon and any associate identifiers of interest, and if there's a link to an English language Wikipedia page I do a quick SPARQL query to DBpedia to retrieve the snippet of text. At some point all of this could be sped up by adding the relevant data to the triple store and doing the query locally but for now it works well enough.

Of course, many snippets are little more than stubs, e.g. the snippet for another dung beetle genus Diorygopyx doesn't tell us much more than we can get from the information already displayed.

But having a text summary still seems worthwhile, which raises the question of what to do when Wikipedia doesn't know anything about a taxon? Obviously, we could start editing Wikipedia to flesh out its content, but that will take a while to filter into databases such as DBpedia. Another approach is to generate snippets from the triple store itself, in other words, generate natural language summaries from structured data. For example, we could generate summaries such as "Diorygopyx is a genus of Scarabaeidae or scarab beetles in the superfamily Scarabaeoidea" fairly easily from knowing the taxonomic hierarchy and a few common names. But we could also do more. In browsing Ozymandias I'm struck at times by how much our knowledge of one taxon depends on a major piece of taxonomic work, often done some time ago. For example, The Australian Crickets (Orthoptera: Gryllidae). Academy of Natural Sciences of Philadelphia, Monograph 22 by Otte and Alexander (1983) is a monumental taxonomic monograph, and many Australian cricket genera had most (or all) of their species described in that work. Imagine having a snippet that mentioned that (e.g., "Most species in this genus were described in 1983, no species have been discovered since."). That would give the reader some useful information, and perhaps also prompt them to ask "so, why haven't any more species been described?".

I think there's scope here to make the output from triple stores (and other databases) more approachable using natural language generation. This is obviously a big area, and there are some very sophisticated approaches for outputting very natural language (think chatbots), perhaps the most striking example of which is Google Duplex.

But we don't need quite this level of sophistication, something using much simpler techniques (e.g., nalgene-js) would probably be enough. Armed with some basic facts from the triple store, and some simple templates, we could probably generate some useful text snippets for many taxa in Ozymandias, and indeed for other entities. For example, David Shorthouse is outputting simple text summaries of the contribution of taxonomists to specimen collection and identification:

Arthur Loveridge identified Scolecomorphidae and collected Chamaeleonidae https://t.co/NimKzmUuOX
— Bloodhound (@BloodhoundTrack) May 24, 2019

Imagine extending this to take into account publications, geography, etc. I think there's lots of scope here for moving beyond just displaying data and trying to generate human-friendly summaries of data.

Thursday, May 26, 2016

Thoughts on Wikipedia, Wikidata, and the Biodiversity Heritage Library

Given that Wikipedia, Wikidata, and the Biodiversity Heritage Library (BHL) all share the goal of making information free, open, and accessible, there seems to be a lot of potential for useful collaboration. Below I sketch out some ideas.

BHL as a source of references for Wikipedia

Wikipedia likes to have sources cited to support claims in its articles. BHL has a lot of articles that could be cited by Wikipedia articles. By adding these links, Wikipedia users get access to further details on the topic of interest. BHL also benefits from greater visibility resulting from visits from Wikipedia readers.

In the short term BHL could search Wikipedia for articles that could benefit from links to BHL (see below). In the long term as more and more BHL articles get DOIs this will become redundant as Wikipedia authors will discover articles via CrossRef.

There are various ways to search Wikipedia to get a sense of what links could be added. For example, you can search the Wikipedia API for pages that link to a particular web domain (see https://www.mediawiki.org/wiki/API:Lists/All#Exturlusage). Here's a search for articles linking to biostor.org https://en.wikipedia.org/w/api.php?action=query&list=exturlusage&euquery=biostor.org&eulimit=20.

A quick inspection suggests that many of these links could be improved (for example, some have outdated links to PDFs and not to the article), so we can locate Wikipedia articles that could be edited. It is likely that Wikipedia articles that have one link to BHL or BioStor may have other citations that could be linked.

Wikipedia as a source of content

One of the big challenges facing BHL is extracting articles from its content. My own BioStor is one approach to tackling this problem. BioStor takes citation details for articles and attempts to locate them in BHL - the limiting factor is access to good-quality citation data. Wikipedia is potentially an untapped source of citation data. Each page that uses the "Cite" template could be mined for citations, which in turn could be used to locate articles. Wikipedia pages using the Cite template can be found via the API, e.g. https://en.wikipedia.org/w/api.php?action=query&list=embeddedin&eititle=Template:Cite&eilimit=20&format=json. Alternatively, we could mine particular types of pages (e.g., those on taxa or taxonomists), or mine Wikispecies (which doesn't use the same citation formatting as Wikipedia).

Wikidata as a data store

If Wikidata aims to be a repository of all structured data relevant to Wikipedia, then this includes bibliographic citations (see WikiCite 2016 ), hence many articles in BHL will end up in Wikidata. This has some interesting implications, because Wikidata can model data with more fidelity than many other sources of bibliographic information. For example, it supports multiple languages as well as multiple representations of the sample language - the journal Acta Herpetologica Sinica https://www.wikidata.org/wiki/Q24159308 in Wikidata has not only the Chinese title (兩棲爬行動物學報) but the pinyin transliteration "Liangqi baxing dongwu yanjiu". Rather than attempt to replicate a community-editable database, Wikidata could be the place to manage article and journal-related metadata.

Disambiguating people

As we move from "strings to things" we need to associate names for things with identifiers for those things. I've touched on this already in Possible project: mapping authors to Wikipedia entries using lists of published works. Ideally each author in BHL would be associated with a globally unique identifier, such as ORCID or ISNI. Contributors to Wikipedia and Wikidata have been collecting these for individuals with Wikipedia articles. If those Wikipedia pages have links to BHL content then we can semi-automate the process of linking people to identifiers.

Caveats

There are a couple of potential "gotchas" concerning Wikipedia and BHL. The licenses used for content are different, BHL is typically CC-BY-NC whereas Wikipedia is CC-BY. The "non commercial" restriction used by BHL is a deal-breaker for sharing content such as page images with Wikicommons.

Wikipedia and Wikidata are communities, and I've often found this makes it challenging to find out how to get things done. Who do you contact to make a decsioon about some new feature you'd like to add? It's not at all obvious (unless you're a part of that community). Existing communities with accepted practices can be resistant to change, or may not be convinced that what you'd like to do is a benefit. For example, I think it would be great to have a Wikipedia page for each journal. Not everyone agrees with this, and one can expend a lot of energy debating the pros and cons. The last time I got seriously engaged with Wikipedia I ended up getting so frustrated I went off in a huff and built my own wiki. This is where a "Wikipedian in residence" might be helpful.

Friday, April 15, 2016

GBIF and impact: CrossRef, FundRef, and Altmetric

Wiki hit For anyone doing research or involved in scientific infrastructure, demonstrating the "impact" of those activities is becoming increasingly important. This has fostered a growth industry in "alt metrics", tools to track how research gathers attention outside academia (of course, we can argue whether attention is the same as impact).

For an organisation such as GBIF there's a clear need to show that it has impact on the field of biodiversity (and beyond), especially to its funders (which are ultimately national governments). To do this GBIF needs to track how its data is used by the research communities, both to do science and to inform policy. This is hard to do, especially if there's a limited culture of data citation. It occurs to me that another way to tackle this problem is to invert it by looking not at the impact of GBIF, but at GBIF as a source of impact.

For a moment let's replace GBIF with Wikipedia. We can ask "what is the impact of Wikipedia on the research community?" For example, Wikipedia is the 8th largest referrer of DOIs, which means that Wikipedia is a major source traffic to academic publishing sites. All those Wikipedia pages which cite the primary literature are driving traffic to those articles.

Conversely, if we regard Wikipedia as important we can use citations of articles in Wikipedia pages as a measure of a researcher's impact. For example, according to Impact story I am "Wikitastic" as 11 Wikipedia pages cite articles that I am an author of (authorship is discovered by using my ORCID 0000-0002-7101-9767).

Likewise, altmetric tracks citations on Wikipedia, so that a paper like the one below may have minimal social media impact but as the gray donut rings signifying that it's been cited on Wikipedia.

JENKINS, P. D., & ROBINSON, M. F. (2002, June). Another variation on the gymnure theme: description of a new species of Hylomys (Lipotyphla, Erinaceidae, Galericinae). Bulletin of The Natural History Museum. Zoology Series. Cambridge University Press (CUP) doi:10.1017/S0968047002000018

Hence, we can look at Wikipedia in two different ways. The first is to ask "what is the impact of Wikipedia?", the second is to assume that Wikipedia has impact, and then use that as one measure of the impact of researchers (how "Wikitastic" you are).

So, let's go back to GBIF. Imagine we leave aside the question of whether GBIF has impact and imagine that we can use GBIF as a measure of impact ("GBIFtastic", sorry, that was unforgivable).

Example 1: From DOI to FundRef to GBIF

In a previous post I discussed the lack of mosquito data in GBIF and how I plugged this gap by using open data cite by a paper in eLife. This paper has the DOI 10.7554/elife.08347 and if I plug that into CrossRef's search engine I can get back some information on the funders of that paper:

Research funded by Sir Richard Southwood Graduate Scholarship | Rhodes Scholarships | National Institutes of Health (RAPIDD program, R01-AI069341, R01-AI091980, R01-GM08322, N01-A1-25489) | Wellcome Trust (#095066, Vecnet, #099872) | National Aeronautics and Space Administration (#NNX15AF36G) | Biotechnology and Biological Sciences Research Council | Bill and Melinda Gates Foundation (#OPP1053338, #OPP52250) | Studienstiftung des Deutschen Volkes | Directorate-General for Research and Innovation (#21803) | European Centre for Disease Prevention and Control (ECDC/09/018)

Now, this gives me a connection between funding agencies, a paper they funded, and the data in GBIF. For example, the Bill and Melinda Gates Foundation (doi:10.13039/100000865) funded doi:10.7554/elife.08347 which generated data in GBIF doi:10.15468/7apj8n.

I suspect that the Bill and Melinda Gates Foundation don't know that they've funded data gathering that has ended up in GBIF, but I suspect they'd be interested. Especially if that could be quantified (een better if we can demonstrate reuse). The process of linking funders to data can be largely automated, especially as more and more papers are now automatically linked to funder information. The link between publications and data in GBIF can be harder to establish, but at least one publisher (Pensoft) has establish a direct feed from publication to GBIF.

So, what if GBIF could computationally discover the funders of the data it holds, and could then communicate that to the funders. I think there's scope here for funders to take an interest in GBIF and it's role in expanding the reuse (and hence impact) of data that funders have paid for. Demonstrating to governments that national funding agencies are supporting research that generates data that ends up in GBIF may help make the case that GBIF is worth supporting.

Example 2: GBIF as altmetric source

The little altmetric donuts that we see on papers require sources of data, such as Twitter, Wikipedia, blogs, etc. For example, the Plant List dataset I recently put into GBIF has a DOI (doi:10.15468/btkum2)and this has received some attention so it has a altimetric donut (wouldn't it be nice if GBIF showed these on dataset pages?):

What if GBIF itself became a source that altimetric scanned when measuring impact? What if having your papers mentioned in GBIF (for example, as a source of distributional data or a taxonomic name) contributed to the visible impact of that work. Wouldn't that encourage people to mobilise their data? Wouldn't that help people discover the wider conversation about the data and associated publications? Wouldn't that help generate more impact for papers that might otherwise gather less attention?

Summary

I realise that I've somewhat avoided the question of the impact of GBIF itself, which is something that also needs to be tackled (and this is one reason why GBIF assigns DOIs to datasets and downloads to support data citation), but I think that may be only a part of the bigger picture. If we assume GBIF is impactful to start with, then I think we can start to think how GBIF can help persuade researchers and funders that contributing to GBIF is a good thing.

Wednesday, January 13, 2016

Surfacing the deep data of taxonomy

My paper "Surfacing the deep data of taxonomy" (based on a presentation I gave in 2011) has appeared in print as part to a special issue of Zookeys:

Page, R. (2016, January 7). Surfacing the deep data of taxonomy. ZooKeys. Pensoft Publishers. http://doi.org/10.3897/zookeys.550.9293

The manuscript was written shortly after the talk, but as is the nature of edited volumes it's taken a while to appear.

My tweet about the paper sparked some interesting comments from David Shorthouse.

@KlausRiede @rdmpage Revitalize? Yes. Through less focus on taxa and more on uniquely identified, shared data objects and altmetrics.
— David Shorthouse (@dpsSpiders) January 8, 2016

This is an appealing vision, because it seems unlikely that having multiple, small communities clustered around taxa will ever have the impact that taxonomists might like to have. Perhaps if we switch to focussing on objects (sequences, specimens, papers), notions of identity (e.g., DOIs, ORCID), and alternative measures of impact we can increase the visibility and perceived importance of the field. In this context, the recent paper "Wikiometrics: A Wikipedia Based Ranking System" http://arxiv.org/abs/1601.01058 looks interesting. A big consideration will be how connected is the network connecting taxonomists, papers, sequences, specimens, and names. If it's anything like the network of readers in Mendeley then we may face some challenges in community building around such a network.

Monday, September 07, 2015

Wikidata, Wikipedia, and #wikisci

Last week I attended the Wikipedia Science Conference (hashtag: #wikisci) at the Wellcome Trust in London. it was an interesting two days of talks and discussion. Below are a few random notes on topics that caught my eye.

What is Wikidata?

Slides for my upcoming #wikisci talk on how to build a repository of all citations with @wikidata http://t.co/pzfAFrfnXc
— Dario Taraborelli (@ReaderMeter) September 3, 2015

A recurring theme was the emergence of Wikidata, although it never really seemed clear what role Wikidata saw for itself. On the one hand, it seems to have a clear purpose:

Wikidata acts as central storage for the structured data of its Wikimedia sister projects including Wikipedia, Wikivoyage, Wikisource, and others.

At other times there was a sense that Wikidata wanted to take any and all data, which it doesn't really seemed geared up to do. The English language Wikipedia has nearly 5 million articles, but there are lots of scientific databases that dwarf that in size (we have at least that many taxonomic names, for example). So, when Dario Taraborelli suggests building a repository of all citations with Wikidata, does he really mean ALL citations in the academic literature? CrossRef alone has 75M DOIs, whereas Wikidata currently has 14.8M pages, so we are talking about greatly expanding the size of Wikidata with just one type of data.

The sense I get is that Wikidata will have an important role in (a) structuring data in Wikipedia, and (b) providing tools for people to map their data to the equivalent topics in Wikipedia. Both are very useful goals. What I find less obvious is whether (and if so, how) Wikidata aims to be a more global database of facts.

How do you Snapchat? You just Snapchat

As a relative outsider to the Wikipedia community, and having had a sometimes troubled experience with Wikipedia, it struck me that how opaque things are if your are an outsider. I suspect this is true of most communities, if you are a member then things seem obvious, if you're not, it takes time to find out how things are done. Wikipedia is a community with nobody in charge, which is a strength, but can also be frustrating. The answer to pretty much any question about how to add data to Wikidata, how to add data types, etc. was "ask the community". I'm reminded of the American complaint about the European Union "if you pick up the phone to call Europe, who do you call?". In order to engage you have to invest time in discovering the relevant part of the community, and then learn engage with it. This can be time consuming, and is a different approach to either having to satisfy the requirements of gatekeepers, or a decentralised approach where you can simply upload whatever you want.

Streams

It seems that everything is becoming a stream. Once the volume of activity reaches a certain point, people don't talk about downloading static datasets, but instead of consuming a stream of data (very like the Twitter firehose). The volume of Wikipedia edits means scissile scientists studying the growth of Wikipedia are now consuming streams. Geoffrey Bilder of CrossRef showed some interesting visualisations of real-time streams of DOIs being as users edited Wikipedia pages CrossRef DOI Events for Wikimedia, and Peter Murray-Rust of ContentMine seemed to imply that ContentMine is going to generate streams of facts (rather than, say, a query able database of facts). Once we get to the stage of having large, transient volumes of data, all sorts of issues about reanalysis and reproducibility arise.

CrossRef and evidence

One of the other strking visualisations that CrossRef have is the DOI Chronograph, which displays the numbers of CrossRef DOI resolutions by the domain of the hosting web site. In other words, if you are on a Wikipedia page and click on a DOI for an article, that's recorded as a DOI resolution from the domain "wikipedia.org". For the period 1 October 2010 to 1 May 2015 Wikipedia was the source of 6.8 million clicks on DOIs, see http://chronograph.labs.crossref.org/domains/wikipedia.org. One way to interpret this is that it's a measure of how many people are accessing the primary literature - the underlying evidence - for assertions made on Wikipedia pages. We can compare this with results for, say, biodiversity informatics projects. For example, EOL has 585(!) DOI clicks for the period 15 October 2010 to 30 April 2015. There are all sorts of reasons for the difference between these two sites, such as Wikipedia has vastly more traffic than EOL. But I think it also reflects the fact that many Wikipedia articles are richly referenced with citations to the primary literature, and projects like EOL are very poorly linked to that literature. Indeed, most biodiversity databases are divorced from the evidence behind the data they display.

Diversity and a revolution led by greybeards

Okay, we got the message - more women should ask questions at #wikisci No need to emphasize after every talk. Sometimes we just don't have Q
— Eva Amsen (@easternblot) September 3, 2015

"Diversity" is one of those words that has become politicised, and attempts to promote "diversity" can get awkward ("let's hear from the women", that homogeneous category of non-men). But the aspect of diversity that struck me was age-related. In discussions that involved fellow academics, invariably they looked a lot like me - old(ish), relatively well-established and secure in their jobs (or post-job). This is a revolution led not by the young, but by the greybeards. That's a worry. Perhaps it's a reflection of the pressures on young or early-stage scientists to get papers into high-impact factor journals, get grants, and generally play the existing game, whereas exploring new modes of publishing, output, impact, and engagement have real risks and few tangible rewards if you haven't yet established yourself in academia.

Monday, August 10, 2015

Possible project: mapping authors to Wikipedia entries using lists of published works

220px Boulenger George 1858 1937 One of the less glamorous but necessary tasks of data cleaning is mapping "strings to things", that is, taking strings such as "George A. Boulenger" and mapping them to identifiers, such as ISNI: 0000 0001 0888 841X. In case of authors such as George Boulenger, one way to do this would be through Wikipedia, which has entries for many scientists, often linked to identifiers for those people (see the bottom of the Wikipedia page for George A. Boulenger and look at the "Authority Control" section).

How could we make these mappings? Simple string matching is one approach, but it seems to me that a more robust approach could use bibliographic data. For example, if I search for George A. Boulenger in BioStor, I get lots of publications. If at least some of these were listed on the Wikipedia page for this person, together with links back to BioStor (or some other external identifier, such as DOIs), then we could do the following:

Search Wikipedia for names that matched the author name of interest
If one or more matches are found, grab the text of the Wikipedia pages, extract any literature cited (e.g., in the {cite} tag), get the bibliographic identifiers, and see if they match any in our search results.
If we get one or more hits, then it's likely that the Wikipedia page is about the author of the papers we are looking at, and so we link to it.
Once we have a link to Wikipedia, extract any external identifier for that person, such as ISNI or ORCID.

For this to work, it requires that the Wikipedia page cites works by the author in a way that we can harvest, and uses identifiers that we can match to those in the relevant database (e.g., BioStor, Crossef, etc.). We might also have to look at Wikipedia pages in multiple languages, given that English-language Wikipedia may be lacking information on scholars from non-English speaking countries (this will be a significant issue for many early taxonomists).

Based on my limited browsing of Wikipedia, there seems to be little standardisation of entries for people, certainly little in how their published works are listed (the section heading, format, how many, etc.). The project I'm proposing would benefit from a consistent set of guidelines for how to include a scholar's output.

What makes this project potentially useful is that it could help flesh out Wikipedia pages by encouraging people to add lists of published works, it could aid bibliographic repositories like my own BioStor by increasing the number of links they get from Wikipedia, and if the Wikipedia page includes external identifiers then it helps us go from strings to things by giving us a way to locate globally unique identifiers for people.

Thursday, August 02, 2012

Google Knowledge Graph using data from BBC and Wikipedia

Google's Knowledge Graph can enhance search results by display some structured information about a hit in your list of results. It's available in the US (i.e., you need to use www.google.com, although I have seen it occasionally appear for google.co.uk.

Fruitbat

Here is what Google displays for Eidolon helvum (the straw-coloured fruit bat). You get a snippet of text from Wikipedia, and also a map from the BBC Nature Wildlife site. Wikipedia is a well-known source of structured data (in that you can mine the infoboxes for information). The BBC site has some embedded RDFa and structured HTML, and you can also get RDF (just append ".rdf" to the URL, i.e., http://www.bbc.co.uk/nature/life/Straw-coloured_Fruit_Bat.rdf). There doesn't seem to be anything in the RDF about the distribution map, so presumably Google are extracting that information from the HTML.

It would be interesting to think about what other biodiversity data providers, such as GBIF and EOL could do to get their data incorporated into Google's Knowledge Graph, and eventually into these search result snippets.

Saturday, June 02, 2012

Linking NCBI taxonomy to GBIF

@rdmpage, given an NCBI taxon ID, how to get GBIF occurrence records via web service?
— Rutger Vos (@rvosa) May 30, 2012

In response to Rutger Vos's question I've started to add GBIF taxon ids to the iPhylo Linkout website. If you've not come across iPhylo Linkout, it's a Semantic Mediawiki-based site were I maintain links between the NCBI taxonomy and other resources, such as Wikipedia and the BBC Nature Wildlife finder. For more background see

Page, R. D. M. (2011). Linking NCBI to Wikipedia: a wiki-based approach. PLoS Currents, 3, RRN1228. doi:10.1371/currents.RRN1228

I'm now starting to add GBIF ids to this site. This is potentially fraught with difficulties. There's no guarantee that the GBIF taxonomy ids are stable, unlike NCBI tax_ids which are fairly persistent (NCBI publish deletion/merge lists when they make changes). Then there are the obvious problems with the GBIF taxonomy itself. But, if you want a way to generate a distribution map for a taxon in the NCBI taxonomy, the quickest way is going to be via GBIF.

The mapping is being made automatically, with some crude checks to try and avoid too many erroneous links (e.g., due to homonyms). It will probably take a few days to complete (the mapping is quick, uploading to the wiki is a bit slower). Using a wiki to manage the mapping makes it easy to correct any spurious matches.

As an example, the page http://iphylo.org/linkout/Ncbi:109175 is for the frog Hyla japonica (NCBI tax_id 109175) and shows links to Wikipedia (http://en.wikipedia.org/wiki/Japanese_Tree_Frog, and to GBIF (http://data.gbif.org/species/2427601/). There's even a link to TreeBASE. I display a GBIF map so you can see what data GBIF currently has for that taxon.

Hyla

So, we have a wiki page, how do we answer Rutger's original question: how to get GBIF occurrence records via web service?

To do this we can use the RDF output by the Semantic Mediawiki software that underpins the Wiki. You can gte this by clicking on the RDF icon near the bottom of the page, or go to http://iphylo.org/linkout/Special:ExportRDF/Ncbi:109175. The RDF this produces is really, really ugly (and people wonder why the Semantic Web has been slow to take off...). In this RDF you will see the statement:


<rdfs:seeAlso rdf:resource="http://data.gbif.org/species/2427601/"/>

So, arm yourself with XPath, a regular expression, or if you are a serious RDF geek break out the SPARQL, and you can extract the GBIF taxon id for a NCBI taxon. Given that id you can query the GBIF web services. One service that I like is the occurrence density service, which you can use to recreate the 1°×1° density maps shown by GBIF. For example, http://data.gbif.org/ws/rest/density/list?taxonconceptkey=2427601 will get you the squares shown in the screen shot above.

Of course, I have glossed over several issues, such as the errors and redundancy in the GBIF classification, the mismatch between NCBI and GBIF classifications (NCBI has many more ranks than GBIF), and whether the taxon concepts used by the two databases are equivalent (this is likely to be more of an issue for higher taxa). But it's a start.

Thursday, March 31, 2011

Paper on NCBI and Wikipedia published in PLoS Currents: Tree of Life

My paper describing the mapping between NCBI and Wikipedia has been published in PLoS Currents: Tree of Life. You can see the paper here. It's only just gone live, so it's yet to get a PubMed Central number (one of the nice features of PLoS Currents is that the articles get archived in PMC).

Publishing in PLoS Currents: Tree of Life was a pleasant experience. The Google Knol editing environment was easy to use, and the reviewing process quick. It's obviously a new and rather experimental journal, and there are a few things that could be improved. Automatically looking up articles by PubMed identifier is nice, but it would also be great to do this for DOIs as well. Furthermore, the PubMed identifiers aren't displayed as clickable links, which rather defeats the point of having references on the web (I've added DOI links to the articles wherever possible). But, minor grumbles aside, as a way to get an Open Access article published for free, and have it archived in PubMed Central, PLoS Currents is hard to beat. What will be interesting is whether the article receives any comments. This seems to be one area online journals haven't really cracked — providing an environment where people want to engage in discussion.

Thursday, March 24, 2011

TreeBASE meets NCBI, again

Déjà vu is a scary thing. Four years ago I released a mapping between names in TreeBASE and other databases called TBMap (described here: doi:10.1186/1471-2105-8-158). Today I find myself releasing yet another mapping, as part of my NCBI to Wikipedia project. By embedding the mapping in a wiki, it can be edited, so the kinds of problems I encountered with TbMap, recounted here, here, and here. The mapping in and of itself isn't terribly exciting, but it's the starting point for some things I want to do regarding how to visualise the data in TreeBASE.

Because TreeBASE 2 has issued new identifiers for its taxa (see TreeBASE II makes me pull my hair out), and now contains its own mapping to the NCBI taxonomy, as a first pass I've taken their mapping and added it to http://iphylo.org/linkout. I've also added some obvious mappings that TreeBASE has missed. There are a lot more taxa which could be added, but this is a start.

The TreeBASE taxa that have a mapping each get their own page with a URL of the form http://iphylo.org/linkout/<TreeBase taxon identifier>, e.g. http://iphylo.org/linkout/TB2:Tl257333. This page simply gives the name of the taxon in TreeBASE and the corresponding NCBI taxon id. It uses a Semantic Mediawiki template to generate a statement that the TreeBASE and and NCBI taxa are a "close match". If you go to the corresponding page in the wiki for the NCBI taxon (e.g., http://iphylo.org/linkout/Ncbi:448631) you will see any corresponding TreeBASE taxa listed there. If a mapping is erroneous, we simply need to edit the TreeBASE taxon page in the wiki to fix it. Nice and simple.

At the time of writing the initial mapping is still being loaded (this can take a while). I'll update this post when the uploading has finished.

Tuesday, March 01, 2011

Zooming a large tree, now with thumbnails

Continuing experiments with a zoom viewer for large trees (see previous post), I've now made a demo where the labels are clickable. If the NCBI taxon has an equivalent page in Wikipedia the demo displays and link to that page (and, if present, a thumbnail image). Give it a try at

http://iphylo.org/~rpage/deeptree/3.html

or watch the short video clip below:

Zoomable viewer with Wikipedia thumbnails from Roderic Page on Vimeo.

Wednesday, December 15, 2010

TreeBASE, again

My views on TreeBASE are pretty well known. Lately I've been thinking a lot about how to "fix" TreeBASE, or indeed, move beyond it. I've made a couple of baby steps in this direction.

The first step is that I've created a group for TreeBASE papers on Mendeley. I've uploaded all the studies in TreeBASE as of December 13 (2010). Having these in Mendeley makes it easier to tidy up the bibliographic metadata, add missing identifiers (such as DOIs and PubMed ids), and correct citations to non-existent papers (which can occur if at the time the authors uploaded their data the planned to submit their paper to one journal, but it ending up being accepted in another). If you've a Mendeley account, feel free to join the group. If you've contributed to TreeBASE, you should find your papers already there.

The second step is playing with CouchDB (this years new hotness), exploring ways to build a database of phylogenies that has nothing much to do with either a relational database or a triple store. CouchDB is a document store, and I'm playing with taking NeXML files from TreeBASE, converting them to something vaguely usable (i.e., JSON), and adding them to CouchDB. For fun, I'm using my NCBI to Wikipedia mapping to get images for taxa, so if TreeBASE has mapped a taxon to the NCBI taxonomy, and that taxon has a page in Wikipedia with an image, we get an image for that taxon. The reason for this is I'd really like a phylogeny database that was visually interesting. To give you some examples, here are trees from TreeBASE (displayed using SVG), together with thumbnails of images from Wikipedia:

Everything (tree and images) is stored within a single document in CouchDB, making the display pretty trivial to construct. Obviously this isn't a proper interface, and there's things I'd need to do, such as order the images in such a way that they matched the placement of the taxa on the tree, but at a glance you can see what the tree is about. We could then envisage making the images clickable so you could find out more about that taxon (e.g., text from Wikipedia, lists of other trees in the database, etc.).

We could expand this further by extracting geographical information (say, from the sequences included in the study) and make a map, or eventually a phylogeny on Google Earth) (see David Kidd's recent "Geophylogenies and the Map of Life" for a manifesto doi:10.1093/sysbio/syq043).

One of the big things missing from databases like TreeBASE is a sense of "fun", or serendipity. It's hard to find stuff, hard to discover new things, make new connections, or put things in context. And that's tragic. Try a Google image search for treebase+phylogeny:

Call me crazy, but I looked at that and thought "Wow! This phylogeny stuff is cool!" Wouldn't it be great if that's the reaction people had when they looked at a database of evolutionary trees?

Friday, July 09, 2010

Wikipedia paper out

My short note on "Wikipedia as an Encyclopaedia of Life" has appeared in Organisms Diversity & Evolution (doi:10.1007/s13127-010-0028-9) (yes, I do occasionally write papers). A preprint of this paper is available on Nature Precedings (hdl: 10101/npre.2010.4242.1).

My presentation at iEvoBio covers much the same ground, and is included below, although the paper was written before I made the mapping from NCBI taxa to Wikipedia pages.

Phyloinformatics in the age of Wikipedia (warning, do not view if easily offended)

View more presentations from Roderic Page.

Tuesday, June 22, 2010

Mashing up NCBI and Wikipedia using treemaps

Having made a first stab at mapping NCBI taxa to Wikipedia, I thought it might be fun to see what could be done with it. I've always wanted to get quantum treemaps working (quantum treemaps ensure that the cells in the treemap are all the same size, see my 2006[!] blog post for further description and links). After some fussing I have some code that seems to do the trick. As an example, here is a quantum treemap for Laurasiatheria.

The diagram shows the NCBI taxonomy subtree rooted on Laurasiatheria, with images (where available) from Wikipedia for the children of the the children of that node. In other words, the images correspond to the tips of the tree below:

There's a lot to be done to tidy this up, but there is potential to create a nice, visual way to navigate through the NCBI taxonomy (it might work well on the iPhone or iPad, for example).

Thursday, June 17, 2010

NCBI to Wikipedia links are now live...

The 52,956 links from NCBI to Wikipedia that I've been busy creating are now "live." If you go to a NCBI taxon such as Sphaerius you'll see something like this:

Clicking the "Wikipedia" link takes you to the Wikipedia page for this taxon. You can see all the links to Wikipedia using the query loproviphylo[filter]. Here are some additional links to try:

NCBI	Wikipedia
8353	Xenopus
83698	Banksia
9766	Balaenoptera

Thanks to Scott Federhen and Kathy Kwan at NCBI for all their assistance in getting this into NCBI Linkout.

Fixing errors
There will be errors and omissions. The best way to fix these is by using the iPhylo Linkout wiki. The page for a NCBI taxon is always http://iphylo.org/linkout/Ncbi:xxxx where xxxx is the NCBI taxonomy id. You can edit/annotate the link there (click on the "edit with form" for a simple web form). I plan to regularly update the links based on this the wiki.

Future
NCBI Linkout provide access statistics so it will be interesting to see how much traffic goes from NCBI to Wikipedia. It will also be interesting to see if this is correlated with increased editing of those Wikipedia pages.

Tuesday, June 08, 2010

Linking NCBI to Wikipedia

180px-Sphaerius.acaroides.Reitter.tafel64.jpg

In an earlier post I discussed linking NCBI taxonomy to Wikipedia. One way to tackle this is to add NCBI Taxonomy ID to Wikipedia pages. I reopened the case for adding the Taxonomy IDs to the Taxobox on each taxon page, but this met with substantial resistance. A modified proposal to add them elsewhere to the Wikipedia page seems to be gaining more support (or, at least, less vigorous resistance).

Meanwhile, there are other things that need to be done to linking NCBI and Wikipedia. One is to add Wikipedia page names to NCBI Linkout so that when viewing a NCBI taxon page you will see a link to Wikipedia if a page for the corresponding taxon exists. To create this linkout we need a mapping from NCBI to Wikipedia, and that's what I've been working on for the last few days.

The mapping is still in progress, but essentially I've taken a dump of the NCBI taxonomy for June 3, 2010, and matched the names with those in a the June 18, 2009 dump of Wikipedia that I've analysed elsewhere on this blog. I'll detail the various steps in the mapping elsewhere (there are issues such as synonyms, homonyms, Wikipedia redirects, etc.), but for now things seem to be working reasonably well.

The mapping is being created in a Semantic Mediawiki at http://iphylo.org/linkout/. When complete you will be able to up a NCBI taxon by either it's name (including synonyms and common names) or it's NCBI Taxonomy ID. Where possible I'm mapping the NCBI taxon to Wikipedia, and providing a snippet of text and an image.

I've also extracted bibliographic information from the citations.dmp file that comes with the NCBI dump. This contains the comments that you sometimes see on a taxon page. In a few cases I've added some information manually. For example, the beetle genus Sphaerius has a rather complicated nomenclatural history, which the NCBI page summarises as:

Due to a recent ruling (ICZN 2000), the family and generic names Sphaeriusidae Erichson, 1845, and Sphaerius Waltl, 1838, are both available names and have priority over Microsporidae Crotch, 1873 for the family name and Microsporus Kolenati, 1846 for the single included genus, respectively.

By looking through BioStor I've found some of the papers relating to this ICZN ruling, and added them to the wiki page http://iphylo.org/linkout/Ncbi:174920 (aficionados of zoological nomenclature may enjoy the complexity of the case, due to homonymy between the corresponding family name, Sphaeriidae, and a mollusc family of the same name).

Once thus mapping is complete, it will be time to think of how to get this into NCBI's Linkout, and also how to automatically update the mapping to reflect the growth of both the NCBI taxonomy and Wikipedia. If you visit http://iphylo.org/linkout/ please be aware that the mapping is still being written to the wiki (this is being done via API calls, and adding some 900,000 pages is going to take a while).

Thursday, May 20, 2010

NCBI Taxonomy IDs and Wikipedia

I've written a note on the Wikipedia Taxobox page making the case for adding NCBI taxonomy IDs to the standard Taxobox used to summarise information about a taxon. Here is what I wrote:

Wikipedia's taxon pages have a huge web presence (see my blog post Google and Wikipedia revisited and Page, R. D. M. (2010). "Wikipedia as an encyclopaedia of life". Nature Precedings hdl:10101/npre.2010.4242.1). If a taxon is in Wikipedia it is almost always the first search result in Google. Researchers in other areas of biology are making use of a Wikipedia as a tool to annotate genes Gene Wiki and RNA families Wikipedia:WikiProject_RNA, respectively. Pages for genes, such as Cytochrome_b, have numerous external identifiers in their equivalent of the Taxobox (the Pfam_box). I think we are missing a huge opportunity by not including NCBI taxonomy ids. The advantages would be:

It would provide a valuable service to Wikipedia readers by enabling them to go to NCBI to discover more about a taxon

It would help Wikipedia contributors by providing a standardised way to refer to NCBI (and enable bots to add missing NCBI taxonomy ids). Putting them in an External links section makes it harder to be consistent (there are various ways to write a URL linking to the NCBI taxonomy)

It would facilitate linking from NCBI to Wikipedia. A mapping of Wikipedia pages to NCBI taxonomy ids could be added to NCBI Linkout, generating more traffic to the Wikipedia pages

Projects that are trying to integrate information from different sources would be able to combine information of genomics from NCBI with other information much more readily

Note that I am not arguing that Wikipedia should "follow" NCBI taxonomy, merely that where the potential to link exists, the links would create value, both within and outside the Wikipedia community.

Some discussion has ensued on the Taxobox page, all positive. I'm blogging this here to encourage anyone who as any more thoughts on the matter to contribute to the discussion.

Monday, March 15, 2010

How Wikipedia can help scope a project

I'm revisiting the idea of building a wiki of phylogenies using Semantic Mediawiki. One problem with a project like this is that it can rapidly explode. Phylogenies have taxa, which have characters, nucleotides sequences and other genomics data, and names, and come from geographic locations, and are collected and described by people, who may deposit samples in museums, and also write papers, which are published in journals, and so on. Pretty soon, any decent model of a phylogeny database is connected to pretty much anything of interest in the biological sciences. So we have a problem of scope. At what point do we stop adding things to the database model?

It seems to me that Wikipedia can help. Once we hit a topic that exists in Wikipedia, then we can stop. It's a reasonable bet that either now, or at some point in the future, the Wikipedia page is likely to be as good as, or better than, anything a single project could do. Hence, there's probably not much point storing lots of information about genes, countries, geographic regions, people, journals, or even taxa, as Wikipedia has these. This means we can focus on gluing together the core bits of a phylogenetic study (trees, taxa, data, specimens, publications) and then link these to Wikipedia.

In a sense this is a variation on the ideas explored in EOL, the BBC, and Wikipedia, but in developing my wiki of phylogenies project (this is the third iteration of this project) it's struck me how the question "is this in Wikipedia?" is the quickest way to answer the question "should I add x to my wiki?" Hence, Wikipedia becomes an antidote to feature bloat, and helps define the scope of a project more clearly.

Wednesday, March 03, 2010

Wikipedia manuscript

I've written up some thoughts on Wikipedia for a short invited review to appear (pending review) in Organisms, Environment, and Diversity (ISSN 1439-6092). The manuscript, entitled "Wikipedia as an encyclopaedia of life" is available as a preprint from Nature Precedings (hdl:10101/npre.2010.4242.1). The opening paragraph is:

In his 2003 essay E O Wilson outlined his vision for an "encyclopaedia of life" comprising "an electronic page for each species of organism on Earth", each page containing "the scientific name of the species, a pictorial or genomic presentation of the primary type specimen on which its name is based, and a summary of its diagnostic traits." Although the "quiet revolution” in biodiversity informatics has generated numerous online resources, including some directly inspired by Wilson’s essay (e.g., http://ispecies.org, http://www.eol.org), we are still some way from the goal of having available online all relevant information about a species, such as its taxonomy, evolutionary history, genomics, morphology, ecology, and behaviour. While the biodiversity community has been developing a plethora of databases, some with overlapping goals and duplicated content, Wikipedia has been slowly growing to the point where it now has over 100,000 pages on biological taxa. My goal in this essay is to explore the idea that, largely independent of the efforts of biodiversity informatics and well-funded international efforts, Wikipedia (http://en.wikipedia.org/wiki/Main_Page) has emerged as potentially the best platform for fulfilling E O Wilson’s vision.

The content will be familiar to readers of this blog, although the essay is perhaps a slightly more sober assessment of Wikipedia than some of my blog posts would suggest. It was also the first manuscript I'd written in MS Word for a while (not a fun experience), and the first ever for which I'd used Zotero to manage the bibliography (which worked surprisingly well).

Tuesday, February 02, 2010

EOL, the BBC, and Wikipedia

Last month EOL took the brave step of including Wikipedia content in its pages. I say "brave" because early on EOL was pretty reluctant to embrace Wikipedia on this scale (see the report of the Informatics Advisory Group that I chaired back in 2008), and also because not all of EOL's curators have been thrilled with this development. Partly to assuage their fears, EOL displays Wikipedia-derived content on a yellow background to flag its "unreviewed" status, such as this image of the python genus Leiopython:

It's interesting to compare EOL's approach to Wikipedia with that taken by the BBC, as documented in Case Study: Use of Semantic Web Technologies on the BBC Web Sites. The BBC makes extensive use of content from community-driven external sites such as MusicBrainz and Wikipedia. They embed the content in their own pages, stating where the content came from, but not flagging it as any less meaningful or reliable than the BBC's own content (i.e., no garish yellow background).

Furthermore, the BBC does two clever things. Firstly:

To facilitate integration with the resources external to bbc.co.uk the music site reuses MusicBrainz URL slugs and Wildlife Finder Wikipedia URL slugs. This means that it is relatively straight forward to find equivalent concepts on Wikipedia/DBpedia and Wildlife Finder and, MusicBrainz and /music.

This means that if the identifier for the artist Bat for Lashes in Musicbrainz is http://musicbrainz.org/artist/10000730-525f-4ed5-aaa8-92888f060f5f.html, the BBC reuse the "slug" 10000730-525f-4ed5-aaa8-92888f060f5f and create a page at http://www.bbc.co.uk/music/artists/10000730-525f-4ed5-aaa8-92888f060f5f. Likewise, if the Wikipedia page for Varanus komodoensis is http://en.wikipedia.org/wiki/Komodo_dragon, then the BBC Wildlife Finder page becomes http://www.bbc.co.uk/nature/species/Komodo_dragon, reusing the slug Komodo_dragon.

Reusing identifiers like this can greatly facilitate linking between databases. I don't need to do a search, or approximate string matching, I just reuse the slug. Note that this is a two-way thing, it is trivial for Musicbrainz to create links to BBC information, and visa versa. Reusing identifiers isn't new, other examples include Amazon.com's ASIN (which for books are ISBNs), and BHL reuses uBio NameBankIDs -- want literature that mentions the Komodo dragon? Use the uBio NameBankID 2546401 in a BHL URL http://www.biodiversitylibrary.org/name/2546401.

The second clever thing the BBC does is treat the web as a content management system:

BBC Music is underpinned by the Musicbrainz music database and Wikipedia, thereby linking out into the Web as well as improving links within the BBC site. BBC Music takes the approach that the Web itself is its content management system. Our editors directly contribute to Musicbrainz and Wikipedia, and BBC Music will show an aggregated view of this information, put in a BBC context.

Instead of separating BBC and Wikipedia content (and putting the later in quarantine as does EOL), the BBC embraces Wikipedia, editing Wikipedia content if they feel a page need improving. One advantage of this approach is that it avoids the need for the BBC to replicate Wikipedia, either in terms of content (the BBC doesn't need to write its own descriptions of what an organism does) or services (the BBC doesn't need to develop tools for people to edit the BBC pages, people use Wikipedia's infrastructure for this). Wikipedia provides core text and identifiers, BBC provides its own unique content and branding.

EOL is trying something different, and perhaps more challenging (at least to do it properly). Given that both EOL and Wikipedia offer text about organisms, there is likely to be overlap (and possibly conflict) between what EOL and Wikipedia say about the same taxon. Furthermore, there will be duplication of information such as bibliographic references. For example, the Wikipedia content included in the EOL page for Leiopython contains a bibliography, which includes these references:

Hubrecht AAW. 1879. Notes III on a new genus and species of Pythonidae from Salawatti. Notes from the Leyden Museum 14-15.

Boulenger GA. 1898. An account of the reptiles and batrachians collected by Dr. L. Loria in British New Guinea. Annali del Museo Civico de Storia Naturale di Genova (2) 18:694-710

The genus name Leiopython was published by Hubrecht (1879), and Boulenger (1898) is cited in support of a claim that a distribution record is erroneous. Hence, these look like useful papers to read. Neither reference on the Wikipedia page is linked to an online version of the article, but both have been scanned by EOL's partner BHL (you can see the articles in BioStor here, and here, respectively)¹.

Problem is, you'd be hard pressed to discover this from the EOL page. The BHL results do list the journal Notes from the Leyden Museum, but you'd have to visit the links manually to discover whether they include Hubrecht (1879) (they do, as well as various occurences of Leiopython in the indices for the journal). In part this problem is a consequence of the crude way EOL handles bibliographies retrieved from BHL, but it's symptomatic of a broader problem. By simply mashing EOL and Wikipedia content together, EOL is missing an opportunity to make both itself and Wikipedia more useful. Surely it would be helpful to discover what publications cited on Wikipedia pages are in BHL (or in the list of references for hand-curated EOL pages)? This requires genuine integration (for example by reusing existing bibliographic identifiers such as DOIs, and tools such as OpenURL resolvers). If it fails to do this, EOL will resemble crude pre-Web 2.0 mashups where people created web pages that had content from external sites enclosed in <IFRAME> tags.

The contrast between the approaches adopted by EOL and the BBC is pretty stark. The BBC has devolved text content to external, community-driven sites that it thinks will do a better job than the BBC could alone. EOL is trying to integrate Wikipedia into it's own text content, but without addressing the potentially massive duplication (and, indeed, possible contradictions) that are likely to arise. Perhaps it's time for EOL to be as brave as the BBC, as ask itself whether it is sensible for EOL to try and occupy the same space as Wikipedia.

1. Note that the bibliographic details of both papers are wanting, Hubrecht 1879 is in volume 1 of Notes from the Leyden Museum, and Annali del Museo Civico de Storia Naturale di Genova series 2, volume 18 is also treated as volume 38.