iPhylo: May 2016

Roderic D. M. Page

Thursday, May 26, 2016

Thoughts on Wikipedia, Wikidata, and the Biodiversity Heritage Library

Given that Wikipedia, Wikidata, and the Biodiversity Heritage Library (BHL) all share the goal of making information free, open, and accessible, there seems to be a lot of potential for useful collaboration. Below I sketch out some ideas.

BHL as a source of references for Wikipedia

Wikipedia likes to have sources cited to support claims in its articles. BHL has a lot of articles that could be cited by Wikipedia articles. By adding these links, Wikipedia users get access to further details on the topic of interest. BHL also benefits from greater visibility resulting from visits from Wikipedia readers.

In the short term BHL could search Wikipedia for articles that could benefit from links to BHL (see below). In the long term as more and more BHL articles get DOIs this will become redundant as Wikipedia authors will discover articles via CrossRef.

There are various ways to search Wikipedia to get a sense of what links could be added. For example, you can search the Wikipedia API for pages that link to a particular web domain (see https://www.mediawiki.org/wiki/API:Lists/All#Exturlusage). Here's a search for articles linking to biostor.org https://en.wikipedia.org/w/api.php?action=query&list=exturlusage&euquery=biostor.org&eulimit=20.

A quick inspection suggests that many of these links could be improved (for example, some have outdated links to PDFs and not to the article), so we can locate Wikipedia articles that could be edited. It is likely that Wikipedia articles that have one link to BHL or BioStor may have other citations that could be linked.

Wikipedia as a source of content

One of the big challenges facing BHL is extracting articles from its content. My own BioStor is one approach to tackling this problem. BioStor takes citation details for articles and attempts to locate them in BHL - the limiting factor is access to good-quality citation data. Wikipedia is potentially an untapped source of citation data. Each page that uses the "Cite" template could be mined for citations, which in turn could be used to locate articles. Wikipedia pages using the Cite template can be found via the API, e.g. https://en.wikipedia.org/w/api.php?action=query&list=embeddedin&eititle=Template:Cite&eilimit=20&format=json. Alternatively, we could mine particular types of pages (e.g., those on taxa or taxonomists), or mine Wikispecies (which doesn't use the same citation formatting as Wikipedia).

Wikidata as a data store

If Wikidata aims to be a repository of all structured data relevant to Wikipedia, then this includes bibliographic citations (see WikiCite 2016 ), hence many articles in BHL will end up in Wikidata. This has some interesting implications, because Wikidata can model data with more fidelity than many other sources of bibliographic information. For example, it supports multiple languages as well as multiple representations of the sample language - the journal Acta Herpetologica Sinica https://www.wikidata.org/wiki/Q24159308 in Wikidata has not only the Chinese title (兩棲爬行動物學報) but the pinyin transliteration "Liangqi baxing dongwu yanjiu". Rather than attempt to replicate a community-editable database, Wikidata could be the place to manage article and journal-related metadata.

Disambiguating people

As we move from "strings to things" we need to associate names for things with identifiers for those things. I've touched on this already in Possible project: mapping authors to Wikipedia entries using lists of published works. Ideally each author in BHL would be associated with a globally unique identifier, such as ORCID or ISNI. Contributors to Wikipedia and Wikidata have been collecting these for individuals with Wikipedia articles. If those Wikipedia pages have links to BHL content then we can semi-automate the process of linking people to identifiers.

Caveats

There are a couple of potential "gotchas" concerning Wikipedia and BHL. The licenses used for content are different, BHL is typically CC-BY-NC whereas Wikipedia is CC-BY. The "non commercial" restriction used by BHL is a deal-breaker for sharing content such as page images with Wikicommons.

Wikipedia and Wikidata are communities, and I've often found this makes it challenging to find out how to get things done. Who do you contact to make a decsioon about some new feature you'd like to add? It's not at all obvious (unless you're a part of that community). Existing communities with accepted practices can be resistant to change, or may not be convinced that what you'd like to do is a benefit. For example, I think it would be great to have a Wikipedia page for each journal. Not everyone agrees with this, and one can expend a lot of energy debating the pros and cons. The last time I got seriously engaged with Wikipedia I ended up getting so frustrated I went off in a huff and built my own wiki. This is where a "Wikipedian in residence" might be helpful.

Notes on current and future projects

I'll be taking a break shortly, so I thought I'd try to gather some thoughts on a few projects/ideas that I'm working on. These are essentially extended notes to myself to jog my memory when I return to these topics.

BOLD data into GBIF

Following on from work on getting mosquito data into GBIF I've been looking at DNA barcoding data. BOLD data is mostly absent from GBIF. The publicly available data can be downloaded, and is in a form that could be easily ingested by GBIF. One problem is that the data is incomplete, and sometimes out of date. BOLD's data dumps and BOLD's API use different formats (sigh), and the API returns additional data such as image soft voucher specimens. Most data in the data dumps are not identified to species, so they will have limited utility for most GBIF users.

One approach would be to take the data dumps as the basic data, then use the API to enhance that data, such as adding image links. If the API returns a species-level identification for a barcode then that could be added as an identification using the corresponding Darwin Core extension. In this way we could treat the data as an evolving entity, which it is as our knowledge of it improves. For a related example see Leafcutter bee new to science with specimen data on Canadensys where Canadensys record two different identifications of some bee specimens as research showed that some specimens represented a new species.

This work reflects my concern that GBIF is missing a lot of data outside its normal sources. The mechanism for getting data into GBIF is pretty bureaucratic and could do with reforming (or, at least provision of other ways to add data).

BOLD by itself

I've touched on this before (Notes on next steps for the million DNA barcodes map), I'd really like to do something better with the way we display and interact with DNA barcode data. This will need some thought on calculating and visualising massive phylogenies, and spatial queries that return subtrees. I can't help thinking that there's scope for some very cool things in this area. If nothing else, we can do interesting things without getting involved in some of the pain of taxonomic names.

Big trees

Viewing big trees is still something of an obsession. I still think this hasn't been solved in a way that helps us learn about the tree and the entities in that tree. I currently think that a core problem to solve is how to cluster or "fold" a tree in a sensible way to highlight the major groups. I did something rather crude here, other approaches include "Constructing Overview + Detail Dendrogram-Matrix Views" (doi:10.1109/TVCG.2009.130, PDF here).

Graph databases and the biodiversity knowledge graph

I'm working on a project to build a "biodiversity knowledge graph" (see doi:10.3897/rio.2.e8767). In some ways this is recreating (my entry in Elsevier's Grand Challenge "Knowledge Enhancement in the Life Sciences", see also hdl:10101/npre.2009.3173.1 and doi:10.1016/j.websem.2010.03.004).

Currently I'm playing with Neo4J to build the graph from JSON documents stored in CouchDB. Learning Neo4J is taking a little time, especially as I'm creating nodes and edges on the fly and want to avoid creating more than one node for the same thing. In a world of multiple identifiers this gets tricky, but I think there's a reasonable way to do this (see the graph gist Handling multiple identifiers). Since I'm harvesting data I'm ending up building a web crawler, so I need to think about queues, and ways to ensure that data added at different times gets properly linked.

Wikipedia and wikidata

I occasionally play with Wikipedia and Wikidata, although this is often an exercise in frustration as editing Wikipedia tends to result in edit wars ("we don't do things that way"). Communities tend to be conservative. I'll write up some notes about ways Wikipedia and Wikidata can be useful, especially in the context of the Biodiversity Heritage Library (see also Possible project: mapping authors to Wikipedia entries using lists of published works).

All the names

The database of all taxonomic names remains as elusive as ever -- our field should be deeply embarrassed by this, it's just ridiculous.

My own efforts in this area involve (a) obtaining lists of names, by whatever means available, and (b) augmenting them to include links to the primary literature. I've made some this work publicly accessible (e.g., BioNames). I'd like all name databases to make their data open, but most are resistant to the idea (some aggressively so).

One approach to this is to simply ignore the whimpering and make the data available. Another is to consider recreating the data. We have name finding algorithms, and more of the literature is becoming available, either completely open (e.g., BHL) or accessible to mining (see Content Mine). At some point we will be able to recreate the taxonomic name databases from scratch, making the obstinate data providers not longer relevant.

First descriptions

Names, by themselves, are not terribly useful. But the information that hangs off them is. it occurs to me that projects like BioNames (and other things I've been working on such as IPNI names) aren't directly tackling this. Yes, it's nice to have a bibliographic citation/identifier for the original description of a name (or subsequent name changes), but what we'd really like is to be able to (a) see that description and (b) have it accessible to machines. So one thing I plan to add to BioNames automate going from name to the page with the actual description, and display this information. For many names BioNames knows the page number of the description, and hence it's location within the publication. So we need to simply pull out that page (allowing for edge cases where the mapping between digital and physical pages might not be straight forward) and display it (together with text). If we have XML we can also try and locate the descriptions within the text (for some experiments using XSLT see https://github.com/rdmpage/ipni-names/tree/master/pmc ).There's lots of scope for simple text mining here, such as specimen codes (such as type specimens) and associated taxonomic names (synonyms, ecologically associated organisms, etc.).

Dockerize all the things!

Lots of scope to explore using container to provide services. Docker Hub provides Elastic Search and Neo4J, and Bitnami can run Elastic Search and CouchDB on Google's cloud. Hence we can play with various tools without having to install them. The other side is creating containers to package various tools (Global Names is doing this), or using containers to package up particular datasets and the tools needed to explore them. So much to learn in this area.

Wednesday, May 11, 2016

Scott Federhen

Images Awoke this morning to the sad news (via Scott Miller) that Scott Federhen of the NCBI had died. Anyone using the NCBI taxonomy is a beneficiary of Scott's work on bring together taxonomy and genomic data.

Scott contributed both directly and indirectly to this blog. I reported on some of his work linking taxa in NCBI to sequences from type material (NCBI taxonomy database now shows type material), Scott commented on "dark taxa" and DNA barcoding (e.g., Dark taxa even darker: NCBI pulls (some) DNA barcodes from GenBank (updated)), and was an author on a guest post on "Putting GenBank Data on the Map" (in response to http://dx.doi.org/10.1126/science.341.6152.1341-a). He was also very helpful when I wanted to make links between the NCBI taxonomy and Wikipedia in my iPhylo Linkout project (see http://dx.doi.org/10.1371/currents.RRN1228).

I didn't know Scott well, but always enjoyed chatting to him at meetings (most recently the 6th International Barcode of Life Conference at Guelph). He wasn't shy about putting forth his views, or sharing his enthusiasm for ideas. Indeed, last time we met he was handing out paper copies of his preprint "Replication is Recursion; or, Lambda: the Biological Imperative" (available on bioRχiv http://dx.doi.org/10.1101/018804). He then followed up by sending me a t-shirt with the "replication is recursion" logo printed on one side and some penguins on the other (if I remember correctly this was designed by a member of his family). I delight in baffling students by wearing it sometimes when I lecture.

A number of people in the bioinformatics and biodiversity informatics communities are in shock this morning, this is obviously as nothing compared to what his family must be going through.

Tuesday, May 10, 2016

Notes on next steps for the million DNA barcodes map

Some notes to self about future directions for the "million DNA barcodes map" http://iphylo.org/~rpage/bold-map/.

At the moment we have an interactive map that we can pan and zoom, and click on a marker to get a list of one or more barcodes at the location. We can also filter by major taxonomic group. Here are some ideas on what could be next.

Search

At the moment search is simply browsing the map. It would be handy to be able to enter a taxon or a barcode identifier and go to the corresponding markers on the map.

What is this?

If we have a single DNA barcode I immediately want to know "what is this?" A picture may help, and I may look up the scientific name in BioNames, but perhaps the most obvious thing to do is get a phylogeny for that barcode and similar sequences. These could then be displayed on the map using the technique I described in Visualising Geophylogenies in Web Maps Using GeoJSON (see also http://dx.doi.org/10.1371/currents.tol.8f3c6526c49b136b98ec28e00b570a1e).

So, ideally we would:

Display information about that barcode (e.g., taxonomic identification where known).
Display the local phylogeny of barcodes that contains this barcode.
Display that phylogeny on the map

Hence we need to be able to generate a local phylogeny of barcodes, either on the fly (retrieve similar sequences then build tree) or using a precompute global barcode phylogeny from which we pull out the local subtree.

What is there?

A question that the map doesn't really answer is "what is the diversity of a given area?". Yes there are lots of dots, and you can click on them, but what would be nice is the ability to draw a polygon on the map (like this) and get a summary of the phylogenetic diversity of barcodes within that area.

100144 drummond For example, imagine drawing a polygon around Little Barrier Island in New Zealand. Can we effectively retrieve the data published by Drummond et al. ( Evaluating a multigene environmental DNA approach for biodiversity assessment DOI:10.1186/s13742-015-0086-1)?.

To support "what is there?" queries we need to be able to:

Draw an arbitrary spatial region region on the map and retrieve a set of sequences found within that region
Retrieve the phylogeny for that set of sequences

Once agin, we either need to be able to build a phylogeny for an arbitrary set of sequences on the fly, or extract a subtree. If the a global tree is available, we could compute the length of the subtree, and also compute a visual layout fairly easily (essentially with time proportional to the number of sequences).

We'd also need to decide on the best way to visualise the phylogeny for the set of sequences. Perhaps something like Krona, or something more traditional.

Summary

There doesn't seme to be any way of getting away from the need for a global phylogeny of COI DNA barcodes if I want to extend the functionality of the map.

State of open knowledge about the World's Plants

A1BHupvR Kew has released a new report today, entitled the State of the World's Plants, complete with it's own web site https://stateoftheworldsplants.com. Its aim:

...by bringing the available information together into one document, we hope to raise the profile of plants among the global community and to highlight not only what we do know about threats, status and uses, but also what we don’t. This will help us to decide where more research effort and policy focus is required to preserve and enhance the essential role of plants in underpinning all aspects of human wellbeing.

This is, of course, a laudable goal, and a lot of work has gone into this report, and yet there are some things about the report that I find very frustrating.

PDF but no ePub It's nice to have an interactive web site as well as a glossy PDF, but why restrict yourself to a PDF? Why not an ePub so people can view it and rescale fonts for their device, etc. Why not provide the original text in a form people can translate? The report states that much of the newly discovered plant biodiversity is found in Brazil and China, why not make it easier to support automatic translation into Portuguese and Chinese?
Why no DOI for the report? If this is such an important document, why doesn't it have a DOI so it can be easily cited?
Why no DOIs for cited literature? The report cites 219 references, very few of them are accompanied by a DOI, yet most of the references have them. Why not include the DOI so readers can click on that and go straight to the literature. Surely you want to encourage readers to engage with the subject by reading more? The whole point of having digital documents online is that they can link to other documents.
No open access taxonomy Sadly the examples of exciting new plant species discovered are all in closed access publications, including The Gilbertiodendron species complex (Leguminosae: Caesalpinioideae), Central Africa DOI:10.1007/s12225-015-9579-4 published in Kew's own journal Kew Bulletin. This article costs $39.95 / €34.95 / £29.95 to read. Why do taxonomists continue to publish their research, often about taxa in the developing world, behind paywalls?
Why is the data not open? Much of the section on "Describing the world’s plants" uses data from Kew's database IPNI. This database is not open, so how does the reader verify the numbers in the report? Or, more importantly, how does the reader explore the data further and ask questions not asked in the report?

These may seem like small issues given the subject of the report (the perilous state of much of the planet's biodiversity), but if we are to take seriously the goal of "help[ing] us to decide where more research effort and policy focus is required to preserve and enhance the essential role of plants in underpinning all aspects of human wellbeing" then I suggest that open access to knowledge about plant diversity is a key part of that goal.

Over a decade ago Tom Moritz wrote of the need for a "biodiversity commons": DOI:10.1045/june2002-moritz

Provision of free, universal access to biodiversity information is a practical imperative for the international conservation community — this goal should be accomplished by promotion of the Public Domain and by development of a sustainable Biodiversity Information Commons adapting emergent legal and technical mechanisms to provide a free, secure and persistent environment for access to and use of biodiversity information and data. - "Building the Biodiversity Commons" DOI:10.1045/june2002-moritz

The report itself alludes to the importance of "opening up of global datasets with long-time series (such as maps of forest loss)", and yet botany has been slow to do this for much of its data (see Why are botanists locking away their data in JSTOR Plant Science?). We need data on plant taxonomy, systematics, traits, sequences, and distribution to be open and freely available to all, not closed behind paywalls or limited access APIs. Indeed, Donat Agosti has equated copyright to biopiracy (Biodiversity data are out of local taxonomists' reach DOI:10.1038/439392a.

It would be nice to think that Kew, as well as leading the way in summarising the state of the world's plants, would also be leading the way in making that knowledge about those plants open to all.