iPhylo: plans

Roderic D. M. Page

Showing posts with label plans. Show all posts

Wednesday, January 15, 2014

What I'll be working on in 2014: knowledge graphs and Google forests

More for my own benefit than anything else I've decided to list some of the things I plan to work on this year. If nothing else, it may make sobering reading this time next year.

A knowledge graph for biodiversity

Google's introduction of the "knowledge graph" gives us a happy phrase to use when talking about linking stuff together. It doesn't come with all the baggage of the "semantic web", or the ambiguity of "knowledge base". The diagram below is my mental model of the biodiversity knowledge graph (this comes from http://dx.doi.org/10.7717/peerj.190, but I sketched most of this for my Elsevier Challenge entry in 2008, see http://dx.doi.org/10.1038/npre.2008.2579.1).

Fig 1 1x

Parts of this knowledge graph are familiar: articles are published in journals, and have authors. Articles cite other articles (represented by a loop in the diagram below). The topology of this graph gives us citation counts (number of times an article has been cited), impact factor (citations for articles in a given journal), and author-based measures such as the H-index (a function of the distribution of citations for each article you have authored). Beyond simple metrics this graph also gives us the means to track the provenance of an idea (by following the citation trail).

Publication

The next step is to grow this graph to include the other things we care about (e.g., taxa, taxon names, specimens, sequences, phylogenies, localities, etc.).

BioNames

I spent a good deal of last year building BioNames (for background see my blog posts or read the paper in PeerJ http://dx.doi.org/10.7717/peerj.190). BioNames represents a small corner of the biodiversity knowledge graph, namely taxonomic names and their associated publications (with added chocolately goodness of links to taxon concepts and phylogenies). In 2014 I'll continue to clean this data (I seem to be forever cleaning data). So far BioNames is restricted to animal names, but now that the plant folks have relaxed their previously restrictive licensing of plant data (see post on TAXACOM) I'm looking at adding the million or so plant names (once I've linked as many as possible to digital identifiers for the corresponding publications).

Spatial indexing

Now that I've become more involved in GBIF I'm spending more time thinking about spatial indexing, and our ability to find biodiversity data on a map. There's a great Google ad that appeared on UK TV late last year. In it, Julian Bayliss recounts the use of Google Earth to discover of virgin rainforest (the "Google forest") on Mount Mabu in Mozambique.

It's a great story, but I keep looking at this and wondering "how did we know that we didn't know anything about Mount Mabu?" In other words, can we go to any part of the world and see what we know about that area? GBIF goes a little way there with its specimen distribution maps, which gives some idea of what is now known from Mount Mabu (although the map layers used by GBIF are terrible compared to what Google offers).

Mabu

But I want to be able to see all the specimens now known from this region (including the new species that have been discovered, e.g. see http://dx.doi.org/10.1007/s12225-011-9277-9 and http://dx.doi.org/10.1080/21564574.2010.516275). Why can't I have a list of publications relevant to this area (e.g., species descriptions, range extensions, ecological studies, conservation reports)? What about DNA sequences from material in this region (e.g., from organismal samples, DNA barcodes, metagenomics, etc.)? If GBIF is to truly be a "Global Biodiversity Information Facility" then I want it to be able to provide me with a lot more information than it currently does. The challenge is how to enable that to happen.

Monday, May 10, 2010

Next steps for BioStor: citation matching

Thinking about next steps for my BioStor project, one thing I keep coming back to is the problem of how to dramatically scale up the task of finding taxonomic literature online. While I personal find it oddly therapeutic to spend a little time copying and pasting citations into BioStor's OpenURL resolver and trying to find these references in BHL, we need something a little more powerful.

One approach is to harvest as many bibliographies as possible, and extract citations. These citations can come from online bibliographies, as well as lists of literature cited extracted from published papers. By default, these would be treated as strings. If we can parse them to extract metadata (such as title, journal, author, year), that's great, but this is often unreliable. We'd then cluster strings into sets that we similar. If any one of these strings was associated with an identifier (such as a DOI), or if one of the strings in the cluster had been successfully parsed into it's component metadata so we could find it using an OpenURL resolver, then we've identified the reference the strings correspond to. Of course, we can seed the clusters with "known" citation strings. For citations for which we have DOIs/handles/PMIDs/BHL/BioStor URIs, we generate some standard citation strings and add these to the set of strings to be clustered.

We could then provide a simple tool for users to find a reference online: paste in a citation string, the tool would find the cluster of strings the user's string most closely resembles, then return the identifier (if any) for that cluster (and, of course, we could make this a web service to automate processing entire bibliographies at a time).

I've been collecting some references on citation matching (bookmarked on Connotea using the tag "matching") related to this problem. One I'd like to highlight is "Efficient clustering of high-dimensional data sets with application to reference matching" (doi:10.1145/347090.347123, PDF here). The idea is that a large set of citation strings (or, indeed, any strings) can first be quickly clustered into subsets ("canopies"), within which we search more thoroughly:

When I get the chance I need to explore some clustering methods in more detail. One that appeals is the MCL algorithm, which I came across a while ago by reading PG Tips: developments at Postgenomic (where it is used to cluster blog posts about the same article). Much to do...

Tuesday, August 18, 2009

To wiki or not to wiki?

What follows are some random thoughts as I try and sort out what things I want to focus on in the coming days/weeks. If you don't want to see some wallowing and general procrastination, look away now.

I see four main strands in what I've been up to in the last year or so:

services
mashups
wikis
phyloinformatics

Let's take these in turns.

Services
Not glamourous, but necessary. This is basically bioGUID (see also hdl:10101/npre.2009.3079.1). bioGUID provides OpenURL services for resolving articles (it has nearly 84,000 articles in it's cache), looking up journal names, resolving LSIDs, and RSS feeds.

Mashups
iSpecies is my now aging tool for mashing up data from diverse sources, such as Wikipedia, NCBI, GBIF, Yahoo, and Google Scholar. I tweak it every so often (mainly to deal with Google Scholar forever mucking around with their HTML). The big limitation of iSpecies is that it doesn't make it's results reusable (i.e., you can't write a script to call iSpecies and return data). However, it's still the place I go to to quickly find out about a taxon.

The other mashups I've been playing with focus on taking standardised RSS feeds (provided by bioGUID, see above) and mashing them up, sometimes with a nice front end (e.g., my e-Biosphere 09 challenge entry).

Wiki
I've invested a huge amount of effort in learning how wikis (especially Mediawiki and its semantic extensions) work, documented in earlier posts. I created a wiki of taxonomic names as a sandbox to explore some of these ideas.

I've come to the conclusion that for basic taxonomic and biological information, the only sensible strategy for our community is to use (and contribute to) Wikipedia. I'm struggling to see any justification for continuing with a proliferation of taxonomic databases. After e-Biosphere 09 the game's up, people have started to notice that we've an excess of databases (see Claire Thomas in Science, "Biodiversity Databases Spread, Prompting Unification Call", doi:10.1126/science.324_1632).

Phyloinformatics
In truth I've not been doing much on this, apart from releasing tvwidget (code available from Google Code), and playing with a mapping of TreeBASE studies to bibliographic identifiers (available as a featured download from here). I've played with tvwidget in Mediawiki, and it seems to work quite well.

Where now?
So, where now? Here are some thoughts:

I will continue to hack bioGUID (it's now consuming RSS feeds from journals, as well as Zotero). Everything I do pretty much depends on the services bioGUID provides

iSpecies really needs a big overhaul to serve data in a form that can be built upon. But this requires decisions on what that format should be, so this isn't likely to happen soon. But I think the future of mashup work is to use RDF and triple stores (providing that some degree of editing is possible). I think a tool linking together different data sources (along the lines of my ill-fated Elsevier Challenge entry) has enormous potential.

I'm exploring Wikipedia and Wikispecies. I'm tempted to do a quantitative analysis of Wikipedia's classification. I think there needs to be some serious analysis of Wikipedia if people are going to use it as a major taxonomic resource.

If I focus on Wikipedia (i.e., using an existing wiki rather than try to create my own), then that leaves me wondering what all the playing with iTaxon was for. Well, actually I think the original goal of this blog (way back in December 2005) is ideally suited to a wiki. Pretty much all the elements are in place to dump a copy of TreeBASE into a wiki and open up the editing of links to literature and taxonomic names. I think this is going to handily beat my previous efforts (TbMap, doi:10.1186/1471-2105-8-158), especially as errors will be easy to fix.

So, food for thought. Now, I just need to focus a little and get down to actually doing the work.