iPhylo: pro-iBiosphere

Roderic D. M. Page

Showing posts with label pro-iBiosphere. Show all posts

Wednesday, June 11, 2014

The vision thing - it's all about the links

@rdmpage @AlexHardisty @proibiosphere Well, take part in the process of clarification!
— Pensoft Publishers (@Pensoft) June 10, 2014

I've been involved in a few Twitter exchanges about the upcoming pro-iBiosphere meeting regarding the "Open Biodiversity Knowledge Management System (OBKMS)", which is the topic of the meeting. Because for the life of me I can't find an explanation of what "Open Biodiversity Knowledge Management System" is, other than vague generalities and appeals to the magic pixie dust that is "Linked Open Data" and "RDF", I've been grumbling away on Twitter.

So, here's my take on what needs to be done. Fundamentally, if we are going to link biodiversity information together we need to build a network. What we have (for the most part) at the moment is a bunch of nodes (which you can think of as data providers such as natural history collections, databases, etc., or different kinds of data, such as names, publications, sequences, specimens, etc.).

We'd like a network, so that we can link information together, perhaps to discover new knowledge, to serve as a pathway for analyses that combine different sorts of data, and so on:

A network has nodes and links. Without the links there's no network. The fundamental problem as I see it is that we have nodes that have clear stakeholders (e.g., individual, museums, herbaria, publishers, database owners, etc.). They often build links, but they are typically incomplete (they don't link to everything that is relevant), and transitory (there's no mechanism to facilitate persistence of the links). There is no stakeholder for whom the links are most important. So, we have this:

This sucks. I think we need an entity, a project, and organisation, whatever you want to call it for whom the network is everything. In other words, they see the world like this:

If this is how you view the world, then your aim is to build that network. You live or die based on the performance of that network. You make sure the links exist, they are discoverable, and that they persist. You don't have the same interests as the nodes, but clearly you need to provide value to them because they are the endpoints of your links. But you also have users who don't need the nodes per see, they need the network.

If you buy this, then you need to think about how to grow the network. Are there network effects that you can leverage, in the same way CrossRef has with publishers submitting lists of literature cited linked to DOIs, or in social media where you give access to your list of contacts to build your social graph?

If the network is the goal, you don't just think "let's just stick HTTP URLs on everything and it will all be good". You can think like that if you are a node, because if the links die you can still persist (you'll still have people visiting your own web site). But if you are a network and the links die, you are in big trouble. So you develop ways to make the network robust. This is one reason why CrossRef uses an identifier based on indirection, it makes it easier to ensure the network persists in the face of change in how the nodes serve their data. What is often missed is that this also frees up the nodes, because they don't need to commit to serving a given URL in perpetuity, indirections shields them from this.

In order to serve users of the network, you want to ensure you can satisfy their needs rapidly. This leads to things like caching links and basic data about the end points of those links (think how Google caches the contents of web pages so if the site is offline you may still find what you are looking for).

If your business depends on the network, then you need to think how you can create incentives for nodes to join. For example, what services can you offer them that make you invaluable to the nodes? Once you crack that, then all sorts of things can happen. Take structured markup as an example. Google is driving this on the web using schema.org. If you want to be properly indexed by Google, and have Google display your content in a rich form (e.g., thumbnails, review ratings, location, etc.) you need to mark up your page in a way Google understands. Given that some businesses live or die based on their Google ranking, there's a strong incentive for web sites to adopt this markup. There's a strong incentive for Google to encourage markup so that it can provide informative results for its users (otherwise they might rely on "social search" via Facebook and mobile apps). This is the kind of thing you want the network to aim for.

In summary, this is my take on where we are at in biodiversity informatics. The challenge is that the organisations in the room discussing this are typically all nodes, and I'd argue that by definition they aren't in a position to solve the problem. You need to pivot (ghastly word) and think about it from the perspective of the network. Imagine you were to form a company whose mission was to build that network. How would you do it, how would you convince the nodes to engage, what value would you offer them, what value would you offer users of the network? If we start thinking along those lines, then I think we can make progress.

Monday, February 10, 2014

Mark-up of biodiversity literature

I gave a remote presentation at a proiBioSphere workshop this morning. The slides are below (to try and make it a bit more engaging than a desk of Powerpoints I played around with Prezi).

There is a version on Vimeo that has audio as well.

I sketched out the biodiversity "knowledge graph", then talked about how mark-up relates to this, finishing with a few questions. The question that seems to have gotten people a little agitated is the relative importance of markup versus, say, indexing. As Terry Catapano pointed out, in a sense this is really a continuum. If we index content (e.g., locate a string that is a taxonomic name) and flag that content in the text, then we are adding mark-up (if we don't, we are simply indexing, but even then we have mark-up at some level, e.g. "this term occurs some where on this page"). So my question is really what level of markup do we need to do useful work? Much of the discussion so far has centered around very detailed mark-up (e.g., the kind of thing ZooKeys does to each article). My concern has always been how scalable this is, given the size of the taxonomic literature (in which ZooKeys is barely a blip). It's the usual trade off, do we go for breadth (all content indexed, but little or no mark-up), or do we go for depth (extensive mark-up for a subset of articles)? Where you stand on that trade off will determine to what extent you want detailed mark up, versus whether indexing is "good enough".

Thursday, February 14, 2013

Does the legacy biodiversity literature matter?

I've just come back from a pro-iBiosphere Workshop at Leiden where the role of "legacy literature" became the subject of some discussion. This continued on Twitter as Ross Mounce (@rmounce) and I went back and forth:

@rdmpage but ~700,000 papers were published in 2009. Were there even 70,000 published in 1920? 2000-2012 contains *a lot*
— Ross Mounce (@rmounce) February 13, 2013

Ross was wondering whether we should invest much effort in extracting information from legacy literature, suggesting that this literature was of most interest to taxonomists, whereas other biologists will be more likely to find what they want from ever growing recent literature. I was arguing that because many taxa are poorly studied, the chances that you will find data on your organism in the recent literature is likely to be low, unless you study an economically or medically important taxon, or a model organism (many of which fit first categories). My view is based on papers such as Bob May's 1988 paper:

MAY, R. M. (1988). How Many Species Are There on Earth? Science, 241(4872), 1441-1449. doi:10.1126/science.241.4872.1441

In table 3 May lists the average number of papers per species in the period 1978-1987 across various taxonomic groups. Mammals averaged 1.8 papers per species, beetles averaged 0.01. This means that if you study a beetle species you have a 1/100 chance (on average) of finding a paper on your species in any given year (assuming all beetles are equal, which is clearly false). At this point perhaps we should define "legacy literature". In many ways the issue is not so much the age of the literature, but whether the literature was "born digital", that is, whether from it's authoring to publication the document has been in digital form, so the output is in a format (e.g., HTML, XML, or PDF that contains the document text) from which we can readily extract and mine the text. In contrast, documents that have been digitised from a physical medium (e.g., scans of pages) are less tractable because the text has to be extracted by OCR, and error-prone process. Given these errors is the effort worth it. At this point I should say that BHL is not using the best OCR technology available (my own experience suggests that ABBYY Online is much better), and our community is not making use of research on automating OCR correction). But the question is worth asking. In an effort to answer it, I've done a quick analysis of the PanTHERIA database:

Jones, K. E., Bielby, J., Cardillo, M., Fritz, S. A., O Dell, J., Orme, C. D. L., & Purvis, A. (2009). PanTHERIA: a species-level database of life history, ecology, and geography of extant and recently extinct mammals. (W. K. Michener, Ed.)Ecology, 90(9), 2648-2648. doi:10.1890/08-1494.1

PanTHERIA is a database assembled by Kate Jones (@ProfKateJones) and colleagues for comparative biologists (not taxonomists), and collects fundamental biological data about the best studied animal group on the planet (see May's paper above). In the metadata for the database there is a list of the 3143 publications they consulted to populate the database. Below is a table showing the distribution of the year in which these publications appeared:

Decade starting	Publications
1840	1
1860	1
1890	1
1900	10
1910	4
1920	14
1930	48
1940	61
1950	114
1960	295
1970	527
1980	865
1990	1019
2000	183

The bulk of the papers came from the second half of the 20th century, and many of these are "legacy" in the sense that they are in archives like JSTOR, and hence the PDFs are based on scanned images and OCR. The oldest papers are from the 19th century, which is legacy by anyone's definition. My interpretation of this data is that even for a well-studied group such as mammals, the basic organismal-level data sought by comparative biologists is in the "legacy" literature. My suspicion is that if we attempt to build PanTHERIA-style databases for other, less well-studied taxa, the data (if it exists at all) will be found not in the modern literature (where the focus has long since moved on from the organism to genomics and system biology) but in the corpus of taxonomic and ecological literature that are being scanned and stored in digital archives.

Update
I've put the articles cited as data sources by the PanTHERIA database in a Mendeley group.