iPhylo: vision

Roderic D. M. Page

Showing posts with label vision. Show all posts

Wednesday, June 11, 2014

The vision thing - it's all about the links

@rdmpage @AlexHardisty @proibiosphere Well, take part in the process of clarification!
— Pensoft Publishers (@Pensoft) June 10, 2014

I've been involved in a few Twitter exchanges about the upcoming pro-iBiosphere meeting regarding the "Open Biodiversity Knowledge Management System (OBKMS)", which is the topic of the meeting. Because for the life of me I can't find an explanation of what "Open Biodiversity Knowledge Management System" is, other than vague generalities and appeals to the magic pixie dust that is "Linked Open Data" and "RDF", I've been grumbling away on Twitter.

So, here's my take on what needs to be done. Fundamentally, if we are going to link biodiversity information together we need to build a network. What we have (for the most part) at the moment is a bunch of nodes (which you can think of as data providers such as natural history collections, databases, etc., or different kinds of data, such as names, publications, sequences, specimens, etc.).

We'd like a network, so that we can link information together, perhaps to discover new knowledge, to serve as a pathway for analyses that combine different sorts of data, and so on:

A network has nodes and links. Without the links there's no network. The fundamental problem as I see it is that we have nodes that have clear stakeholders (e.g., individual, museums, herbaria, publishers, database owners, etc.). They often build links, but they are typically incomplete (they don't link to everything that is relevant), and transitory (there's no mechanism to facilitate persistence of the links). There is no stakeholder for whom the links are most important. So, we have this:

This sucks. I think we need an entity, a project, and organisation, whatever you want to call it for whom the network is everything. In other words, they see the world like this:

If this is how you view the world, then your aim is to build that network. You live or die based on the performance of that network. You make sure the links exist, they are discoverable, and that they persist. You don't have the same interests as the nodes, but clearly you need to provide value to them because they are the endpoints of your links. But you also have users who don't need the nodes per see, they need the network.

If you buy this, then you need to think about how to grow the network. Are there network effects that you can leverage, in the same way CrossRef has with publishers submitting lists of literature cited linked to DOIs, or in social media where you give access to your list of contacts to build your social graph?

If the network is the goal, you don't just think "let's just stick HTTP URLs on everything and it will all be good". You can think like that if you are a node, because if the links die you can still persist (you'll still have people visiting your own web site). But if you are a network and the links die, you are in big trouble. So you develop ways to make the network robust. This is one reason why CrossRef uses an identifier based on indirection, it makes it easier to ensure the network persists in the face of change in how the nodes serve their data. What is often missed is that this also frees up the nodes, because they don't need to commit to serving a given URL in perpetuity, indirections shields them from this.

In order to serve users of the network, you want to ensure you can satisfy their needs rapidly. This leads to things like caching links and basic data about the end points of those links (think how Google caches the contents of web pages so if the site is offline you may still find what you are looking for).

If your business depends on the network, then you need to think how you can create incentives for nodes to join. For example, what services can you offer them that make you invaluable to the nodes? Once you crack that, then all sorts of things can happen. Take structured markup as an example. Google is driving this on the web using schema.org. If you want to be properly indexed by Google, and have Google display your content in a rich form (e.g., thumbnails, review ratings, location, etc.) you need to mark up your page in a way Google understands. Given that some businesses live or die based on their Google ranking, there's a strong incentive for web sites to adopt this markup. There's a strong incentive for Google to encourage markup so that it can provide informative results for its users (otherwise they might rely on "social search" via Facebook and mobile apps). This is the kind of thing you want the network to aim for.

In summary, this is my take on where we are at in biodiversity informatics. The challenge is that the organisations in the room discussing this are typically all nodes, and I'd argue that by definition they aren't in a position to solve the problem. You need to pivot (ghastly word) and think about it from the perspective of the network. Imagine you were to form a company whose mission was to build that network. How would you do it, how would you convince the nodes to engage, what value would you offer them, what value would you offer users of the network? If we start thinking along those lines, then I think we can make progress.

Thursday, April 18, 2013

Thoughts on GBIC 2012 and a vision of the future of biodiversity informatics

This seems to be the season for big, arm-wavy documents about the future of biodiversity informatics (see A decadal view of biodiversity informatics: challenges and priorities). An equivalent document is being drafted based on the Global Biodiversity Informatics Conference (GBIC 2012) conference. Writing these documents is hard work, they have to balance a set of conflicting visions, predict the future, and communicate a coherent plan to people who either could help make this happen, or feel they have a stake in the outcome.

Leaving all those constraints behind, and waving arms wildly, here's one take on the future of biodiversity informatics. I see three themes.

1. Knowing what we know

We have a limited grasp of how much we actually know, and crap tools to summarise this knowledge. I want a Google Analytics for biodiversity data where I can see at a glance the current state of our knowledge (e.g., what is the rate of sequencing of environmental samples in the Mediterranean? How much of Indonesia's amphibian fauna is in protected areas?). These are fairly trivial queries. If Google can analyse web traffic from sites being hit over a million times per day ( ~ 365 million hits per year) we can do the same thing on GBIF-scale databases. There is huge scope here for cool visualisation of the growth of our knowledge, such as this:

If biologists were explorers (Mammalia)... from Andrew W Hill on Vimeo.

Imagine the GBIF classification like this:

filesystem visualisation from wonderful websolutions on Vimeo.

2. Life stream

Terrible title, but this is where we monitor change, both "organic" and anthropogenic. This is where we use data mining to do a sentiment analysis of the biosphere, looking to detect changes such as outbreaks of disease, invasive species, etc. This builds on 1 but focusses on change. Imagine a "news service" for biology along the lines of tools available to financial markets (e.g., Silobreaker):

This is where we interface with decision makers, in the sense that Braulio Dias's statement "I am convinced that the lack of adequate biodiversity monitoring is at the heart of our difficulties to make convincing arguments" is true, this tackles that question.

3. Modelling the biosphere

Time to model all life on Earth (http://dx.doi.org/10.1038/493295a) is our equivalent of a moon shot (oh how I hate that analogy). Purves et al. have made the case, this is the task that will galvanise people outside the taxonomy/biodiversity community. This is real megascience (1. is data collection, 2. is data mining and analysis). Climate modellers and oceanographers get to do this:

Can we do the same?

Thursday, June 25, 2009

EOL, Wikipedia, TDWG, LinkedData, and the Vision Thing

Time for more half-baked ideas. There's been a lot of discussion on Twitter about EOL, Linked Data (sometimes abbreviated LOD), and Wikipedia. Pete DeVries (@pjd) is keen on LOD, and has been asking why TDWG isn't playing in this space. I've been muttering dark thoughts about EOL, and singing the praises of Wikipedia. On so it goes on. So, here's one vision of where we could (?should) be going with this.

Let's imagine that we do indeed want to play in the Linked Data space. The concern that tends to raised the most is that biodiversity informatics uses LSIDs as the standard GUID, and this doesn't play nice with Linked Data. This is true, but not life threatening. There are various hacks (like this and this that deal with this).

But, the real concern (I think) is that we need a way to link our stuff to the rest of the Linked Data cloud. That is, wherever possible we need to reuse existing identifiers. In the LOD diagram below (for the latest version see here) DBpedia.org is key to linking much of this together, and major players (such as the BBC) are now using DBpedia.org to make connections.

DBpedia.org is based on Wikipedia, so I think you can see where this is going. There are some 120,000+ taxon pages in Wikipedia, so that's some 120,000+ identifiers in DBpedia.org that others interested in organisms can (and will) use to refer to taxa. Given the centrality of Wikipedia and DBpedia to LOD, why don't we adopt DBpedia.org URIs as the default GUID for our taxa? At present we have numerous, competing identifiers (e.g., NCBI tax ids, ITIS tsn's, Catalogue of Life LSIDs, uBio NameBankID's, plus LSIDs from various nomenclators). For users this is a mess -- which one do I use? Deciding requires dealing with issues (such as the difference between nomenclatural codes, and between taxonomic names and concepts, etc., that frankly, nobody outside our community cares about.

So, if we want to play with LOD, we need to make our identifiers play nice (straightforward), and we should think seriously about adopting DBpedia.org URIs as the default GUID for taxa.

Now, where does this leave EOL? Well, frankly, it should get out of the business of making web pages for taxa, because Wikipedia owns that space already. Their pages are fewer, but often much more detailed than the corresponding EOL page, and Wikipedia reacts faster to new discoveries. Wikipedia supports community editing, versioning, and quite sophisticated tools for handling biblographic references.

There's plenty of scope for userful tools and services for EOL to develop, but I think the real game is elsewhere. Now, Wikipedia is far from perfect. It's basically semi-structured text with a God-awful template language, and it would benefit greatly from more structure (e.g., as could be provided by Semantic Mediawiki), but I think we should think about building upon it. We could build our own (and my experiments over at itaxon.org explore this), but the big challenge is getting a community around a project, and if David Shorthouse's pronouncement that The Community is Dead is correct, then maybe we should get on board with the community that already exists. Perhaps what EOL should be doing is talking to Wikipedia, improving the existing templates for taxon pages, and creating bots to automatically populate Wikipedia with more taxon pages.