Thursday, June 23, 2016

Aggregating annotations on the scientific literature: a hack for ReCon 16

7iUlfzBpI will be at ReCon 16 in Edinburgh (hashtag #ReCon_16), the second ReCon event I've attended (see Thoughts on ReCon 15: DOIs, GitHub, ORCID, altmetric, and transitive credit). For the hack day that follows I've put together some instructions for a way to glue together annotations made by multiple people using hypothes.is. It works by using IFTTT to read a user's annotation stream (i.e., the annotations they've made) and then post those to a CouchDB database hosted by Cloudant.

Why, you might ask? Well, I'm interested in using hypothes.is to make machine-readable annotations on papers. For example, we could select a pair of geographic co-ordinates (latitude and longitude) in a paper, tag it "geo", then have a tool that takes that annotation, converts it to a pair of decimal numbers and renders it on a map.

Screenshot 2016 06 23 15 53 07

Or we could be reading a paper and the literature cited lacks links to the cited literature (i.e., there are no DOIs). We could add those by selecting the reference, pasted in the DOI as the annotation, and tagging it "cites". If we aggregate all those annotations then we could write a query that lists all the DOIs of the cited literature (i.e., it builds a small part of the citation graph).

By aggregating across multiple users we effectively crowd source the annotation problem, but in a way that we can still collect those annotations. For this hack I'm going to automate this collection by enabling each user to create an IFTTT recipe that feeds their annotations into the database (they can switch this feature off at any time by switching off the recipe).

Manual annotation is not scalable, but it does enable us to explore different ways to annotate the literature, and what sort of things people may be interested in. For example, we could flag scientific names, great numbers, localities, specimens, concepts, people, etc. We could explore what degree of post-processing would be needed to make the annotations computable (e.g., converting 8°07′45.73″S, 63°42′09.64″W' into decimal latitude and longitude).

If this project works I hope to learn something about people want to extract from the literature, and to what extent having a database of annotations can provide useful information. This will also help inform my thinking about automated annotation, which I've explored in Hypothes.is revisited: annotating articles in BioStor.

Wednesday, June 22, 2016

What happens when open access wins?

The last few days I've been re-reading articles about Ted Nelson's work (including ill-fated Project Xanadu), reading articles celebrating his work (brought together in the open access book "Intertwingled"), playing with Hypothes.is, and thinking about annotation and linking. One of the things which distinguishes Nelson's view of hypertext from the current web is that for Nelson links are first class citizens, they are persistent, they are bidirectional, and can be links not just documents but between parts of documents. In the web we have links that are unidirectional, when I link to something, the page I link to has no idea that I've made that link. Knowing who links to you turns out to be both hard to work out, and very valuable. In the academic world, links between articles (citations) form the basis of commercial databases such as the Web of Science. And of course, the distribution of links between web pages forms the basis of Google's search engine. Just as attempts to build free and open citation databases have come to nothing, there is no free and open search engine to compete with Google.

The chapters in "Intertwingled" make clear that hypertext had a long and varied history before being subsumed by the web. One project which caught my eye was Microcosm, which lead me to the paper "Dynamic link inclusion in online PDF journals" (doi:10.1007/BFb0053299, there's a free preprint here). This article tackles the problem of adding links to published papers. These links could be to other papers (citations), to data sets, to records in online databases (e.g., DNA sequences), names of organisms, etc. The authors outline four different scenarios for adding these links to an article.

In first scenario the reader obtains a paper from a publisher (either open access from behind a paywall), then using a "linkbase" that they have access too they add link to the paper.

Links1

This is very much what Hypothes.is offers, you use their tools to add annotations to a paper, and those annotations remain under your control.

In the second scenario, the publisher owns the linkbase and provides the reader with an annotated version of the paper.

Links2

This is essentially what tools like ReadCube offer. The two remaining scenarios cover the case where the reader doesn't get the paper from the publisher but instead gets the links. In one of these scenarios (shown below) the reader sends the paper to the publisher and gets the linked paper back in return, in the other (not shown) the reader gets the links but uses their own tools to embed them in the paper.

Links3

If you're still with me at this point you may be wondering how all of this relates to the title of this essay ("What happens when open access wins?"). Well, imagine that academic publishing eventually becomes overwhelmingly open access, so that publishers are making content available for free. Is this a sustainable business model? Might a publisher, seeing the writing on the wall, start to think about what they can charge for, if not articles (I'm deliberately ignoring the "author pays" model of open access as I'm not convinced this has a long term future).

In the diagrams above the "linkbase" is on the publisher's side in two of the three scenarios. If I was a publisher, I'd be looking to assembling proprietary databases and linking tools to create value that I could then charge for. I'm sure this is happening already. I suspect that the growing trend to open access for publications is not going to be enough to keep access to scientific knowledge itself open. In many ways publications themselves aren't terribly useful, it's the knowledge they contain that matters. Extracting, cross linking, and interpreting that knowledge is going to require sophisticated tools. The next challenge is going to be ensuring that the "linkbases" generated by those tools remain free and open, or an "open access" victory may turn out to be hollow.

Thursday, May 26, 2016

Thoughts on Wikipedia, Wikidata, and the Biodiversity Heritage Library

Given that Wikipedia, Wikidata, and the Biodiversity Heritage Library (BHL) all share the goal of making information free, open, and accessible, there seems to be a lot of potential for useful collaboration. Below I sketch out some ideas.

BHL as a source of references for Wikipedia

Wikipedia likes to have sources cited to support claims in its articles. BHL has a lot of articles that could be cited by Wikipedia articles. By adding these links, Wikipedia users get access to further details on the topic of interest. BHL also benefits from greater visibility resulting from visits from Wikipedia readers.

In the short term BHL could search Wikipedia for articles that could benefit from links to BHL (see below). In the long term as more and more BHL articles get DOIs this will become redundant as Wikipedia authors will discover articles via CrossRef.

There are various ways to search Wikipedia to get a sense of what links could be added. For example, you can search the Wikipedia API for pages that link to a particular web domain (see https://www.mediawiki.org/wiki/API:Lists/All#Exturlusage). Here's a search for articles linking to biostor.org https://en.wikipedia.org/w/api.php?action=query&list=exturlusage&euquery=biostor.org&eulimit=20.

A quick inspection suggests that many of these links could be improved (for example, some have outdated links to PDFs and not to the article), so we can locate Wikipedia articles that could be edited. It is likely that Wikipedia articles that have one link to BHL or BioStor may have other citations that could be linked.

Wikipedia as a source of content

One of the big challenges facing BHL is extracting articles from its content. My own BioStor is one approach to tackling this problem. BioStor takes citation details for articles and attempts to locate them in BHL - the limiting factor is access to good-quality citation data. Wikipedia is potentially an untapped source of citation data. Each page that uses the "Cite" template could be mined for citations, which in turn could be used to locate articles. Wikipedia pages using the Cite template can be found via the API, e.g. https://en.wikipedia.org/w/api.php?action=query&list=embeddedin&eititle=Template:Cite&eilimit=20&format=json. Alternatively, we could mine particular types of pages (e.g., those on taxa or taxonomists), or mine Wikispecies (which doesn't use the same citation formatting as Wikipedia).

Wikidata as a data store

If Wikidata aims to be a repository of all structured data relevant to Wikipedia, then this includes bibliographic citations (see WikiCite 2016 ), hence many articles in BHL will end up in Wikidata. This has some interesting implications, because Wikidata can model data with more fidelity than many other sources of bibliographic information. For example, it supports multiple languages as well as multiple representations of the sample language - the journal Acta Herpetologica Sinica https://www.wikidata.org/wiki/Q24159308 in Wikidata has not only the Chinese title (兩棲爬行動物學報) but the pinyin transliteration "Liangqi baxing dongwu yanjiu". Rather than attempt to replicate a community-editable database, Wikidata could be the place to manage article and journal-related metadata.

Disambiguating people

As we move from "strings to things" we need to associate names for things with identifiers for those things. I've touched on this already in Possible project: mapping authors to Wikipedia entries using lists of published works. Ideally each author in BHL would be associated with a globally unique identifier, such as ORCID or ISNI. Contributors to Wikipedia and Wikidata have been collecting these for individuals with Wikipedia articles. If those Wikipedia pages have links to BHL content then we can semi-automate the process of linking people to identifiers.

Caveats

There are a couple of potential "gotchas" concerning Wikipedia and BHL. The licenses used for content are different, BHL is typically CC-BY-NC whereas Wikipedia is CC-BY. The "non commercial" restriction used by BHL is a deal-breaker for sharing content such as page images with Wikicommons.

Wikipedia and Wikidata are communities, and I've often found this makes it challenging to find out how to get things done. Who do you contact to make a decsioon about some new feature you'd like to add? It's not at all obvious (unless you're a part of that community). Existing communities with accepted practices can be resistant to change, or may not be convinced that what you'd like to do is a benefit. For example, I think it would be great to have a Wikipedia page for each journal. Not everyone agrees with this, and one can expend a lot of energy debating the pros and cons. The last time I got seriously engaged with Wikipedia I ended up getting so frustrated I went off in a huff and built my own wiki. This is where a "Wikipedian in residence" might be helpful.

Notes on current and future projects

I'll be taking a break shortly, so I thought I'd try to gather some thoughts on a few projects/ideas that I'm working on. These are essentially extended notes to myself to jog my memory when I return to these topics.

BOLD data into GBIF

Following on from work on getting mosquito data into GBIF I've been looking at DNA barcoding data. BOLD data is mostly absent from GBIF. The publicly available data can be downloaded, and is in a form that could be easily ingested by GBIF. One problem is that the data is incomplete, and sometimes out of date. BOLD's data dumps and BOLD's API use different formats (sigh), and the API returns additional data such as image soft voucher specimens. Most data in the data dumps are not identified to species, so they will have limited utility for most GBIF users.

One approach would be to take the data dumps as the basic data, then use the API to enhance that data, such as adding image links. If the API returns a species-level identification for a barcode then that could be added as an identification using the corresponding Darwin Core extension. In this way we could treat the data as an evolving entity, which it is as our knowledge of it improves. For a related example see Leafcutter bee new to science with specimen data on Canadensys where Canadensys record two different identifications of some bee specimens as research showed that some specimens represented a new species.

This work reflects my concern that GBIF is missing a lot of data outside its normal sources. The mechanism for getting data into GBIF is pretty bureaucratic and could do with reforming (or, at least provision of other ways to add data).

BOLD by itself

I've touched on this before (Notes on next steps for the million DNA barcodes map), I'd really like to do something better with the way we display and interact with DNA barcode data. This will need some thought on calculating and visualising massive phylogenies, and spatial queries that return subtrees. I can't help thinking that there's scope for some very cool things in this area. If nothing else, we can do interesting things without getting involved in some of the pain of taxonomic names.

Big trees

Viewing big trees is still something of an obsession. I still think this hasn't been solved in a way that helps us learn about the tree and the entities in that tree. I currently think that a core problem to solve is how to cluster or "fold" a tree in a sensible way to highlight the major groups. I did something rather crude here, other approaches include "Constructing Overview + Detail Dendrogram-Matrix Views" (doi:10.1109/TVCG.2009.130, PDF here).

Graph databases and the biodiversity knowledge graph

I'm working on a project to build a "biodiversity knowledge graph" (see doi:10.3897/rio.2.e8767). In some ways this is recreating (my entry in Elsevier's Grand Challenge "Knowledge Enhancement in the Life Sciences", see also hdl:10101/npre.2009.3173.1 and doi:10.1016/j.websem.2010.03.004).

Currently I'm playing with Neo4J to build the graph from JSON documents stored in CouchDB. Learning Neo4J is taking a little time, especially as I'm creating nodes and edges on the fly and want to avoid creating more than one node for the same thing. In a world of multiple identifiers this gets tricky, but I think there's a reasonable way to do this (see the graph gist Handling multiple identifiers). Since I'm harvesting data I'm ending up building a web crawler, so I need to think about queues, and ways to ensure that data added at different times gets properly linked.

Wikipedia and wikidata

I occasionally play with Wikipedia and Wikidata, although this is often an exercise in frustration as editing Wikipedia tends to result in edit wars ("we don't do things that way"). Communities tend to be conservative. I'll write up some notes about ways Wikipedia and Wikidata can be useful, especially in the context of the Biodiversity Heritage Library (see also Possible project: mapping authors to Wikipedia entries using lists of published works).

All the names

The database of all taxonomic names remains as elusive as ever -- our field should be deeply embarrassed by this, it's just ridiculous.

My own efforts in this area involve (a) obtaining lists of names, by whatever means available, and (b) augmenting them to include links to the primary literature. I've made some this work publicly accessible (e.g., BioNames). I'd like all name databases to make their data open, but most are resistant to the idea (some aggressively so).

One approach to this is to simply ignore the whimpering and make the data available. Another is to consider recreating the data. We have name finding algorithms, and more of the literature is becoming available, either completely open (e.g., BHL) or accessible to mining (see Content Mine). At some point we will be able to recreate the taxonomic name databases from scratch, making the obstinate data providers not longer relevant.

First descriptions

Names, by themselves, are not terribly useful. But the information that hangs off them is. it occurs to me that projects like BioNames (and other things I've been working on such as IPNI names) aren't directly tackling this. Yes, it's nice to have a bibliographic citation/identifier for the original description of a name (or subsequent name changes), but what we'd really like is to be able to (a) see that description and (b) have it accessible to machines. So one thing I plan to add to BioNames automate going from name to the page with the actual description, and display this information. For many names BioNames knows the page number of the description, and hence it's location within the publication. So we need to simply pull out that page (allowing for edge cases where the mapping between digital and physical pages might not be straight forward) and display it (together with text). If we have XML we can also try and locate the descriptions within the text (for some experiments using XSLT see https://github.com/rdmpage/ipni-names/tree/master/pmc ).There's lots of scope for simple text mining here, such as specimen codes (such as type specimens) and associated taxonomic names (synonyms, ecologically associated organisms, etc.).

Dockerize all the things!

Lots of scope to explore using container to provide services. Docker Hub provides Elastic Search and Neo4J, and Bitnami can run Elastic Search and CouchDB on Google's cloud. Hence we can play with various tools without having to install them. The other side is creating containers to package various tools (Global Names is doing this), or using containers to package up particular datasets and the tools needed to explore them. So much to learn in this area.

Wednesday, May 11, 2016

Scott Federhen

ImagesAwoke this morning to the sad news (via Scott Miller) that Scott Federhen of the NCBI had died. Anyone using the NCBI taxonomy is a beneficiary of Scott's work on bring together taxonomy and genomic data.

Scott contributed both directly and indirectly to this blog. I reported on some of his work linking taxa in NCBI to sequences from type material (NCBI taxonomy database now shows type material), Scott commented on "dark taxa" and DNA barcoding (e.g., Dark taxa even darker: NCBI pulls (some) DNA barcodes from GenBank (updated)), and was an author on a guest post on "Putting GenBank Data on the Map" (in response to http://dx.doi.org/10.1126/science.341.6152.1341-a). He was also very helpful when I wanted to make links between the NCBI taxonomy and Wikipedia in my iPhylo Linkout project (see http://dx.doi.org/10.1371/currents.RRN1228).

I didn't know Scott well, but always enjoyed chatting to him at meetings (most recently the 6th International Barcode of Life Conference at Guelph). He wasn't shy about putting forth his views, or sharing his enthusiasm for ideas. Indeed, last time we met he was handing out paper copies of his preprint "Replication is Recursion; or, Lambda: the Biological Imperative" (available on bioRχiv http://dx.doi.org/10.1101/018804). He then followed up by sending me a t-shirt with the "replication is recursion" logo printed on one side and some penguins on the other (if I remember correctly this was designed by a member of his family). I delight in baffling students by wearing it sometimes when I lecture.

A number of people in the bioinformatics and biodiversity informatics communities are in shock this morning, this is obviously as nothing compared to what his family must be going through.

Tuesday, May 10, 2016

Notes on next steps for the million DNA barcodes map

Some notes to self about future directions for the "million DNA barcodes map" http://iphylo.org/~rpage/bold-map/.

Screenshot 2016 05 10 13 52 09

At the moment we have an interactive map that we can pan and zoom, and click on a marker to get a list of one or more barcodes at the location. We can also filter by major taxonomic group. Here are some ideas on what could be next.

Search

At the moment search is simply browsing the map. It would be handy to be able to enter a taxon or a barcode identifier and go to the corresponding markers on the map.

What is this?

If we have a single DNA barcode I immediately want to know "what is this?" A picture may help, and I may look up the scientific name in BioNames, but perhaps the most obvious thing to do is get a phylogeny for that barcode and similar sequences. These could then be displayed on the map using the technique I described in Visualising Geophylogenies in Web Maps Using GeoJSON (see also http://dx.doi.org/10.1371/currents.tol.8f3c6526c49b136b98ec28e00b570a1e).

So, ideally we would:

  1. Display information about that barcode (e.g., taxonomic identification where known).
  2. Display the local phylogeny of barcodes that contains this barcode.
  3. Display that phylogeny on the map
Hence we need to be able to generate a local phylogeny of barcodes, either on the fly (retrieve similar sequences then build tree) or using a precompute global barcode phylogeny from which we pull out the local subtree.

What is there?

A question that the map doesn't really answer is "what is the diversity of a given area?". Yes there are lots of dots, and you can click on them, but what would be nice is the ability to draw a polygon on the map (like this) and get a summary of the phylogenetic diversity of barcodes within that area.

100144 drummondFor example, imagine drawing a polygon around Little Barrier Island in New Zealand. Can we effectively retrieve the data published by Drummond et al. ( Evaluating a multigene environmental DNA approach for biodiversity assessment DOI:10.1186/s13742-015-0086-1)?.

To support "what is there?" queries we need to be able to:

  1. Draw an arbitrary spatial region region on the map and retrieve a set of sequences found within that region
  2. Retrieve the phylogeny for that set of sequences
Once agin, we either need to be able to build a phylogeny for an arbitrary set of sequences on the fly, or extract a subtree. If the a global tree is available, we could compute the length of the subtree, and also compute a visual layout fairly easily (essentially with time proportional to the number of sequences).

We'd also need to decide on the best way to visualise the phylogeny for the set of sequences. Perhaps something like Krona, or something more traditional.

Screen phymmbl

Summary

There doesn't seme to be any way of getting away from the need for a global phylogeny of COI DNA barcodes if I want to extend the functionality of the map.

State of open knowledge about the World's Plants

A1BHupvRKew has released a new report today, entitled the State of the World's Plants, complete with it's own web site https://stateoftheworldsplants.com. Its aim:

...by bringing the available information together into one document, we hope to raise the profile of plants among the global community and to highlight not only what we do know about threats, status and uses, but also what we don’t. This will help us to decide where more research effort and policy focus is required to preserve and enhance the essential role of plants in underpinning all aspects of human wellbeing.

This is, of course, a laudable goal, and a lot of work has gone into this report, and yet there are some things about the report that I find very frustrating.

  1. PDF but no ePub It's nice to have an interactive web site as well as a glossy PDF, but why restrict yourself to a PDF? Why not an ePub so people can view it and rescale fonts for their device, etc. Why not provide the original text in a form people can translate? The report states that much of the newly discovered plant biodiversity is found in Brazil and China, why not make it easier to support automatic translation into Portuguese and Chinese?
  2. Why no DOI for the report? If this is such an important document, why doesn't it have a DOI so it can be easily cited?
  3. Why no DOIs for cited literature? The report cites 219 references, very few of them are accompanied by a DOI, yet most of the references have them. Why not include the DOI so readers can click on that and go straight to the literature. Surely you want to encourage readers to engage with the subject by reading more? The whole point of having digital documents online is that they can link to other documents.
  4. No open access taxonomy Sadly the examples of exciting new plant species discovered are all in closed access publications, including The Gilbertiodendron species complex (Leguminosae: Caesalpinioideae), Central Africa DOI:10.1007/s12225-015-9579-4 published in Kew's own journal Kew Bulletin. This article costs $39.95 / €34.95 / £29.95 to read. Why do taxonomists continue to publish their research, often about taxa in the developing world, behind paywalls?
  5. Why is the data not open? Much of the section on "Describing the world’s plants" uses data from Kew's database IPNI. This database is not open, so how does the reader verify the numbers in the report? Or, more importantly, how does the reader explore the data further and ask questions not asked in the report?

These may seem like small issues given the subject of the report (the perilous state of much of the planet's biodiversity), but if we are to take seriously the goal of "help[ing] us to decide where more research effort and policy focus is required to preserve and enhance the essential role of plants in underpinning all aspects of human wellbeing" then I suggest that open access to knowledge about plant diversity is a key part of that goal.

Over a decade ago Tom Moritz wrote of the need for a "biodiversity commons": DOI:10.1045/june2002-moritz

Provision of free, universal access to biodiversity information is a practical imperative for the international conservation community — this goal should be accomplished by promotion of the Public Domain and by development of a sustainable Biodiversity Information Commons adapting emergent legal and technical mechanisms to provide a free, secure and persistent environment for access to and use of biodiversity information and data. - "Building the Biodiversity Commons" DOI:10.1045/june2002-moritz

The report itself alludes to the importance of "opening up of global datasets with long-time series (such as maps of forest loss)", and yet botany has been slow to do this for much of its data (see Why are botanists locking away their data in JSTOR Plant Science?). We need data on plant taxonomy, systematics, traits, sequences, and distribution to be open and freely available to all, not closed behind paywalls or limited access APIs. Indeed, Donat Agosti has equated copyright to biopiracy (Biodiversity data are out of local taxonomists' reach DOI:10.1038/439392a.

It would be nice to think that Kew, as well as leading the way in summarising the state of the world's plants, would also be leading the way in making that knowledge about those plants open to all.

Wednesday, April 27, 2016

Possible project: Biodiversity dashboard

Mattern 1 dashboard 1020x703 Despite the well deserved scepticism about dashboards voiced by Shannon Mattern @shannonmattern (see Mission Control: A History of the Urban Dashboard, I discovered this by reading Ignore the Bat Caves and Marketplaces: lets talk about Zoning by Leigh Dodds @ldodds) I'm intrigued by the idea a dashboard for biodiversity. We could have several different kinds of information, displayed in a single place.

Immediate information

There are sites such as Global Forest Watch Fires that track events that affect biodiversity and which are haoppoening right now. Some of this data can be harvested (e.g., from the NASA Fire Information for Resource Management System) to show real-time forest fires. Below is an image for the last 24 hours:

We could also have Twitter feeds of these sorts of events

Historical trends

We could have longer-term trends, such as changes in forest cover, or changes in abundance of species over time.

Trends in information

We could have feeds that show us how our knowledge is changing. For example, we could have a map of data from the newest datasets uploaded to GBIF, the lastest DNA barcodes, etc.

As an example, @wikiredlist tweets overtime an article about a species from the IUCN Red List is edited on the English language Wikipedia.

Imagine several such streams, both as lists and as maps. As another example, a while ago I created a visualisation of new species discoveries:

Summary

I'm aware of the irony of drawing inspiration from a critique of dashboards, but I still think there is value in having an overview of global biodiversity. But we shouldn't loose site of the fact that such views will be biassed and constrained, and in many cases it will be much easy to visualise what is going on (or, at least, what our chosen sources reveal) than to effect change on those trends that we find most alarming.

Thursday, April 21, 2016

Searching GBIF by drawing on a map

One of my frustrations with the GBIF portal is that it is hard to drill down and search in a specific area. You have to zoom in and then click for a list of occurrences in the current bounding box of the map. You can't, for example, draw a polygon such as the boundary of a protected area and search within that area.

As a quick and dirty hack I've thrown together a demo of how it would be possible to search GBIF by drawing the search area on a map. Once a shape is drawn, we call GBIF's API to retrieve the first 300 occurrences from that area. The code is here, and below is a live demo (see also http://bl.ocks.org/rdmpage/43073981694598fecab725a16e890d3b).

This demo uses Leaflet.draw to draw shapes, and Wicket to convert the GeoJSON shape to the WKT format required by GBIF's API. I was inspired by the Leaflet.draw plugin with options set demo by d3noob, and used it as a starting point.