Thursday, October 04, 2007

processing PhylOData (pPOD)

The first pPod workshop happened last month at NESCent, and some of the presentations are online on the pPod Wiki. Although I'm a "consultant" I couldn't be there, which is a pity because it looks to have been an interesting meeting. When pPod was first announced I blogged some of my own thoughts on phylogenetics databases. The first meeting had lots of interesting stuff on workflows and data integration, as well as outlining the problems faced by large-scale systematics. Some relevant links (partly here to as a reminder to myself to explore these further):


David Shorthouse said...

The Orchestra developers could simplify their task if they took a look at RSSBus as a platform, which though still based on Windoze has now reached RC1 status. "You can automatically serve your feed in any of the supported formats (ATOM, RSS, HTML, JSON, XLS) by simply changing an input parameter or format the feed your way using a custom feed formatter."

Rod Page said...


I think it's a little more complicated than that. I think Orchestra is really addressing the issue of periodic updates to data that may have changed both locally and remotely, and may exist in multiple copies.

For example, I think it is aimed at cases where, say, I have specimen information in a GenBank record, as well as from a DiGIR provider, and I may also have edited that information locally (e.g., I may have georeferenced it). If and when the DiGIR provider or GenBank update their records, how do I combine those with my copy of the data? If GenBank and the DiGIR provider conflict, which source do I treat as being more reliable. I think this is what Orchestra is really about.

David Shorthouse said...

True, semi-automating decisions like that are a component of Orchestra, but it's a consistent wrapping up of & combining of data resources (note: RSSBus does POST too, not just GET) that is still a hurdle that has to be cleared.

Zack Ives said...

Hi, Rod and David,

Thanks for highlighting our work. I think our group agrees with essentially all of the points made above: indeed, the more formats available to us, the better. But as Rod says, we are focusing on a problem that's much more than finding or importing feeds.

The central problem we're trying to address is the following: how do we keep multiple changing databases "synchronized" when (1) they all use different data representations, (2) they might modify the results of the data they import (e.g., adding some sort of curation info), and want this curation to remain even if we import new data. Additionally, we are seeking to make the task of data transformation easier (via query-like "schema mappings"), track where data originated (provenance), and automatically resolve conflicts if the site administrator has known policies.

We'd certainly be interested in hearing your perspective on what features are useful / necessary in this context. The features of our system have been pretty heavily influenced by the genomics / proteomics arena and phylogeny is certain to be a bit different.

Rod Page said...


Some examples of databases being out of sync:

GenBank sequences have information that is often out of date, such "unpublished" references that have since appeared in print (see here for one example).

Specimen databases can easily get out of sync with the literature, as I've documented here.

There is also the issue of combining information from taxonomic databases. When I mapped names from TreeBASE to taxonomic datasets (see TBMap and doi:10.1186/1471-2105-8-158) I uncovered some cases of mistakes in both TreeBASE and the source data bases.

In each of these cases, the issue is what happens if I create a locally curated version of these resources that fixes these mistakes, then I go to update the externally derived information?

Schema mapping in most cases is relatively straightforward for the stuff that I'm interested in (publications, GenBank, specimens, taxa), but discovering conflicts and resolving them seem to be harder problems.