I'll be taking a break shortly, so I thought I'd try to gather some thoughts on a few projects/ideas that I'm working on. These are essentially extended notes to myself to jog my memory when I return to these topics.
BOLD data into GBIF
Following on from work on getting mosquito data into GBIF I've been looking at DNA barcoding data. BOLD data is mostly absent from GBIF. The publicly available data can be downloaded, and is in a form that could be easily ingested by GBIF. One problem is that the data is incomplete, and sometimes out of date. BOLD's data dumps and BOLD's API use different formats (sigh), and the API returns additional data such as image soft voucher specimens. Most data in the data dumps are not identified to species, so they will have limited utility for most GBIF users.One approach would be to take the data dumps as the basic data, then use the API to enhance that data, such as adding image links. If the API returns a species-level identification for a barcode then that could be added as an identification using the corresponding Darwin Core extension. In this way we could treat the data as an evolving entity, which it is as our knowledge of it improves. For a related example see Leafcutter bee new to science with specimen data on Canadensys where Canadensys record two different identifications of some bee specimens as research showed that some specimens represented a new species.
This work reflects my concern that GBIF is missing a lot of data outside its normal sources. The mechanism for getting data into GBIF is pretty bureaucratic and could do with reforming (or, at least provision of other ways to add data).
BOLD by itself
I've touched on this before (Notes on next steps for the million DNA barcodes map), I'd really like to do something better with the way we display and interact with DNA barcode data. This will need some thought on calculating and visualising massive phylogenies, and spatial queries that return subtrees. I can't help thinking that there's scope for some very cool things in this area. If nothing else, we can do interesting things without getting involved in some of the pain of taxonomic names.Big trees
Viewing big trees is still something of an obsession. I still think this hasn't been solved in a way that helps us learn about the tree and the entities in that tree. I currently think that a core problem to solve is how to cluster or "fold" a tree in a sensible way to highlight the major groups. I did something rather crude here, other approaches include "Constructing Overview + Detail Dendrogram-Matrix Views" (doi:10.1109/TVCG.2009.130, PDF here).Graph databases and the biodiversity knowledge graph
I'm working on a project to build a "biodiversity knowledge graph" (see doi:10.3897/rio.2.e8767). In some ways this is recreating (my entry in Elsevier's Grand Challenge "Knowledge Enhancement in the Life Sciences", see also hdl:10101/npre.2009.3173.1 and doi:10.1016/j.websem.2010.03.004).Currently I'm playing with Neo4J to build the graph from JSON documents stored in CouchDB. Learning Neo4J is taking a little time, especially as I'm creating nodes and edges on the fly and want to avoid creating more than one node for the same thing. In a world of multiple identifiers this gets tricky, but I think there's a reasonable way to do this (see the graph gist Handling multiple identifiers). Since I'm harvesting data I'm ending up building a web crawler, so I need to think about queues, and ways to ensure that data added at different times gets properly linked.
Wikipedia and wikidata
I occasionally play with Wikipedia and Wikidata, although this is often an exercise in frustration as editing Wikipedia tends to result in edit wars ("we don't do things that way"). Communities tend to be conservative. I'll write up some notes about ways Wikipedia and Wikidata can be useful, especially in the context of the Biodiversity Heritage Library (see also Possible project: mapping authors to Wikipedia entries using lists of published works).All the names
The database of all taxonomic names remains as elusive as ever -- our field should be deeply embarrassed by this, it's just ridiculous.My own efforts in this area involve (a) obtaining lists of names, by whatever means available, and (b) augmenting them to include links to the primary literature. I've made some this work publicly accessible (e.g., BioNames). I'd like all name databases to make their data open, but most are resistant to the idea (some aggressively so).
One approach to this is to simply ignore the whimpering and make the data available. Another is to consider recreating the data. We have name finding algorithms, and more of the literature is becoming available, either completely open (e.g., BHL) or accessible to mining (see Content Mine). At some point we will be able to recreate the taxonomic name databases from scratch, making the obstinate data providers not longer relevant.