iPhylo: iPhylo Demo

I started this blog with the goal of documenting my own efforts to make a database of evolutionary trees, based on ideas sketched in hdl:10.1038/npre.2007.1028.1. I've felt that the major task is link phylogenies to other information, such as taxon names, specimens, localities, images, publications, etc. That is, to embed trees in a broader context. Discovering how to engage with that broader context led to a bunch of experiments, toys, and diversions:

iSpecies, a toy to aggregate information on a species.
Semant, experiments with RDF and triple stores (AKA the Semantic Web).
bioGUID, an attempt to make identifiers resolvable, with an increasing focus on developing an OpenURL resolver for biodiversity literature.

iSpecies and bioGUID are still operational, but the ant work fell victim to server crashes, and a growing frustration with the limitations of triple stores. Blogs for all three projects document their histories: iSpecies, Semant, and bioGUID. In a sense, these blogs document the steps along the way to iPhylo.

Based on this experience, I started again with what I've previously referred to as a database of everything. The first public demo is online at iphylo.org/~rpage/demo1. It's very crude, but may give a sense of what I'm trying to do.

The goal of iPhylo is to treat biodiversity objects as equal citizens. Each object has a unique identifier, associated metadata, and is linked to other objects (for example, a specimen is linked to sequences, sequences are linked to publications, etc.). By following the links it is possible to generate new views on existing information, such as a map for study that doesn't have any maps. For example, below is a map generated for Brady et al. (doi:10.1073/pnas.0605858103), based on links between sequences and specimens (if you can't see the map you need a SVG-capable web browser, such as Safari 3 or Firefox 2).

Ironically there are no phylogenies yet. At this stage I'm trying to link the bits together.

How does it work?
More on this later. Briefly, iPhylo uses a entity-attribute-value model database to store objects and their relationships. Like bioGUID, iPhylo relies on a suite of web services (most external, some I've developed locally) to locate and resolve identifiers. iPhylo resolves identifiers for PubMed records, GenBank sequences, museum specimens, publications, etc. and adds the associated metadata to a local database. Wherever possible it resolves any links in the metadata (e.g., if a GenBank record mentions a specimen, iPhylo will try and retrieve information on that specimen). When you view an object in iPhylo, these links are displayed. iPhylo will also try and convert bibliographic records to identifiers (such as DOIs) if no identifiers are provided, and also extracts georeferences for specimens and sequences, either from original records or by using a georeferencing service. Taxonomic names are resolved using uBio, and are treated as "tags."

At present iPhylo is being populated by various scripts, there is no facility for users to add data. This is something I will add in the future.

Getting the data
One of the biggest challenges is getting data (or, to be more precise, figuring out how to harvest available data). iPhylo builds on code for bioGUID. I've also been exploring bulk harvesting data sources. Sometimes this is easy. Many sequences in Genbank are linked to records in PubMed, so if you know the Pubmed id for a paper, you can harvest the sequences. For example, even though the Bulletin of the American Museum of Natural History isn't indexed in PubMed, it is possible to retrieve all the sequences GenBank records as being from papers published in the Bulletin. You can retrieve this list in XML form by clicking here.

Why?

There are all sorts of things which could be done with this. For example, by linking objects together we can also track the provenance of data, and ultimately build "citation networks" of specimens, sequences, etc. For background see my paper on "Biodiversity informatics: the challenge of linking data and the role of shared identifiers" (doi:10.1093/bib/bbn022, preprint at hdl:10101/npre.2008.1760.1).

As mentioned above, we can generate maps even if the original study didn't include one (by following the links). Given that we can geotag studies, this opens up the possibility to query studies spatially. For example, this study on bats and this study on rodents deal with taxa with very similar distributions. A spatial query could find these easily. Imagine interested in, say, Madagascar, and being able to find phylogenetic studies , even if the title and abstract of the paper don't mention Madagascar by name.

There also potential to clean data. One of the first studies I uploaded is Grant et al.'s study of dart-poison frogs. The map for this study shows a outlying point in California:

This point is MVZ 199828, which is a specimen of the salamander Aneides flavipunctatus. In GenBank, MVZ 199828 is listed as the voucher for seven sequences from the frog Mannophryne trinitatis. Oops. A quick iSpecies search, and a click on the GBIF map reveals that there is a specimen MVZ 199838 of Mannophryne trinitatis. i suspect that this is the true voucher for these sequences, and the GenBank records contain a typo.

Future
This is all still very, very crude. The demo is slow, the queries aren't particularly clever, and I've probably gone overboard on using Javascript to populate the web pages. The real value isn't in the web pages, but the links between the data objects. This is my main focus -- extracting and adding links. For now the data is displayed but you can't edit it. However, this is coming. Very basic RDF and RSS feeds are available for each object, and fans of microformats will find some goodies, and sociologists of science may find some of the coauthorship graphs intriguing

iPhylo

Monday, May 05, 2008

iPhylo Demo

1 comment: