iPhylo: TreeBASE

Roderic D. M. Page

Showing posts with label TreeBASE. Show all posts

Wednesday, May 11, 2022

Thoughts on TreeBASE dying(?)

@rvosa is Naturalis no longer hosting Treebase? https://t.co/MBRgcxaBmR
— Hilmar Lapp (@hlapp) May 10, 2022

So it looks like TreeBASE is in trouble, it's legacy Java code a victim of security issues. Perhaps this is a chance to rethink TreeBASE, assuming that a repository of published phylogenies is still considered a worthwhile thing to have (and I think that question is open).

Here's what I think could be done.

The data (individual studies with trees and data) are packaged into whatever format is easiest (NEXUS, XML, JSON) and uploaded to a repository such as Zenodo for long term storage. They get DOIs for citability. This becomes the default storage for TreeBASE.
The data is transformed into JSON and indexed using Elasticsearch. A simple web interface is placed on top so that people can easily find trees (never a strong point of the original TreeBASE). Trees are displayed natively on the web using SVG. The number one goal is for people to be able to find trees, view them, and download them.
To add data to TreeBASE the easiest way would be for people to upload them direct to Zenodo and tag them "treebase". A bot then grabs a feed of these datasets and adds them to the search engine in (1) above. As time allows, add an interface where people upload data directly, it gets curated, then deposited in Zenodo. This presupposes that there are people available to do curation. Maybe have "stars" for the level of curation so that users know whether anyone has checked the data.

There's lots of details to tweak, for example how many of the existing URLs for studies are preserved (some URL mapping), and what about the API? And I'm unclear about the relationship with Dryad.

My sense is that the TreeBASE code is very much of its time (10-15 years ago), a monolithic block of code with SQL, Java, etc. If one was starting from scratch today I don't think this would be the obvious solution. Things have trended towards being simpler, with lots of building blocks now available in the cloud. Need a search engine? Just spin up a container in the cloud and you have one. More and more functionality can be devolved elsewhere.

Another other issue is how to support TreeBASE. It has essentially been a volunteer effort to date, with little or no funding. One reason I think having Zenodo as a storage engine is that it takes care of long term sustainability of the data.

I realise that this is all wild arm waving, but maybe now is the time to reinvent TreeBASE?

Updates

It's been a while since I've paid a lot of attention to phylogenetic databases, and it shows. There is a file-based storage system for phylogenies phylesystem (see "Phylesystem: a git-based data store for community-curated phylogenetic estimates" https://doi.org/10.1093/bioinformatics/btv276) that is sort of what I had in mind, although long term persistence is based on GitHub rather than a repository such as Zenodo. Phylesystem uses a truly horrible-looking JSON transformation of NeXML (NeXML itself is ugly), and TreeBASE also supports NeXML, so some form of NeXML or a JSON transformation seems the obvious storage format. It will probably need some cleaning and simplification if it is to be indexed easily. Looking back over the long history of TreeBASE and phylogenetic databases I'm struck by how much complexity has been introduced over time. I think the tech has gotten in the way sometimes (which might just be another way of saying that I'm not smart enough to make sense of it all.

So we could imagine a search engine that covers both TreeBASE and Open Tree of Life studies.

Basic metadata-based searches would be straightforward, and we could have a user interface that highlights the trees (I think TreeBASE's biggest search rival is a Google image search). The harder problem is searching by tree structure, for which there is an interesting literature without any decent implementations that I'm aware of (as I said, I've been out of this field a while).

So my instinct is we could go a long way with simply indexing JSON (CouchDB or Elasticsearch), then need to think a bit more cleverly about higher taxon and tree based searching. I've always thought that one killer query would be not so much "show me all the trees for my taxon" but "show me a synthesis of the trees for my taxon". Imagine a supertree of recent studies that we could use as a summary of our current knowledge, or a visualisation that summarises where there are conflicts among the trees.

Relevant code and sites

CDAO Tools, see "CDAO-Store: Ontology-driven Data Integration for Phylogenetic Analysis" https://doi.org/10.1186/1471-2105-12-98
PhyloCommons

Thursday, March 03, 2016

iSpecies meets TreeBASE

I'm continuing to play with the new version of iSpecies, seeing just how far one can get by simply grabbing JSON from various sources and mashing them up. Since the Open Tree of Life is pretty unresolved ("OMG it's full of stars") I've started to grab trees from TreeBASE and add those. Sadly TreeBASE is showing it's age and doesn't have a JSON API, so I had to break my rule of only using HTML and Javascript in iSpecies and I had to write some PHP wrappers to talk to TreeBASE. Now, when you search for a genus or species you may see a list of studies from TreeBASE, and a popup menu where you can select a tree to view.

Below is a example (searching for the plant genus Fitzalania). Ispecies treebase

This example shows one reason phylogenies are useful. Although GBIF (which supplies the data for the map) recognises Fitzalania, a recent study in TreeBASE shows that this renders Meiogyne paraphyletic, and so moves the Fitzalania to Meiogyne. Hence GBIF's taxonomy is somewhat behind the current state of knowledge about these plants.

The paper merging these two genra (doi:10.1600/036364414x680825) also shows up in the CrossRef results. Unfortunately TreeBASE doesn't have the DOI for the paper, so linking these two results (the TreeBASE study and the corresponding paper) will require some work. This is another reason why I'm playing with iSpecies: I want to see how many identifiers we can uncover to connect results from different sources, and how many cross links we need to add before it all comes together in a nice linked graph of data.

Thursday, September 05, 2013

"Lost Branches on the Tree of Life" - why must the answer be enforcing behaviour?

Bryan Drew and colleagues have published a piece in PLoS Biology bemoaning the lack of databased phylogenies:

Drew, B. T., Gazis, R., Cabezas, P., Swithers, K. S., Deng, J., Rodriguez, R., Katz, L. A., et al. (2013). Lost Branches on the Tree of Life. PLoS Biology, 11(9), e1001636. doi:10.1371/journal.pbio.1001636 (see also blog post Dude, Where’s My Data?)

This is an old problem (see for example "Towards a Taxonomically Intelligent Phylogenetic Database" doi:10.1038/npre.2007.1028.1), but alas the solution proposed by Drew et al. is also old:

Optimally, all peer-reviewed journals that publish phylogenetic datasets should require deposition (and activation for public access) of alignments and trees prior to publication, and these trees and alignments will include the same characters and taxa (and taxon names) as in the published study.

In my opinion, as soon as you start demanding people do something you've lost the argument, and you're relying on power ("you don't get to publish with us unless you do 'x'"). This is also lazy. In a talk I gave to the NSF AVATOL meeting I argued that this is the wrong approach, when building shared resources carrots are better than sticks.

Late night thoughts of a jet-lagged phylogeneticist from Roderic Page

In that talk I used the example of Mendeley where they build an incredibly valuable resource (a bibliography of academic research in the cloud that they sold for $US 100M) by providing a service that meet people's needs ("where's that damn PDF again?"). No brow beating, no "you must do this", just clever social engineering.

So, my challenge to the phylogenetics community (and the authors of "Lost Branches on the Tree of Life" in particular) is to stop resorting to bullying people, and ask instead how you could make it a no brainer for people to share their trees. In other words, build something people actually need and will be inspired to contribute to.

Friday, October 19, 2012

The failure of phylogeny databases

Only 4% of all published phylogenies have data in #treebase or @datadryad
— Karen Cranston (@kcranstn) October 18, 2012

It is well known that phylogeny databases such as TreeBASE capture a small fraction of the published phylogenies. This raises the question of how to increase the number of trees that get archived. One approach is compulsion:

@kcranstn @theatavism @rdmpage Wellcome Trust withhold 10% of grant if not #OA. Why not similar policy for data too? #HitThemInThePocket
— Ross Mounce (@rmounce) October 19, 2012

In other words:

Databasing trees is the Right Thing™ to do
Few people are doing the Right Thing™
This is because those people are bad/misguided and must be made to see the light

I want to suggest an alternative explanation:

It is not at all obvious that databasing trees is useful
The databases we have suck
There's no obvious incentive for the people producing trees to database them

Why do we need a database of trees?

That we don't have a decent, widely used database of trees suggests that the argument still has to be made. Way back in the mid 1990's when TreeBASE was first starting I was at Oxford University and Paul Harvey (coauthor of The Comparative Method in Evolutionary Biology) was sceptical of its merits. Given that the comparative method depends on phylogenies, and people like Andy Purvis were in the Harvey lab building supertrees (http://dx.doi.org/10.1098/rstb.1995.0078) this may seem odd (it certainly did to me) but Paul shared the view of many systematists. Phylogenies are labile, they change with increased data and taxon sampling, hence individual trees have a short life span.

Data, in contrast, is long-lived. You'd happily reuse GenBank sequences published a decade ago, you probably wouldn't use a decade-old phylogeny. I made this point in an earlier post about the data archive Dryad (Data matters but do data sets?). A problem facing packages of data (such as papers, data sets, and phylogenies) is that the package itself may be of limited interest, beyond reproducing earlier results and benchmarking. In the case of phylogenies, if someone has a tree ((a,b),c) and someone else has a tree ((d,e),f), it's not obvious that we can combine these. But if we have sequences for the same gene from the same six taxa we can build a larger tree, say (((a,d),(b,e)),(c,f)).

I think this is part of the reason why GenBank works. Yes, there is compulsion (it's very hard to publish on sequences if you haven't deposited the data in GenBank), but there are clear benefits of depositing data. As the database grows we can do bigger analyses. If you are trying to identify a species based on its DNA, the chances are that the nearest sequence will have been deposited by somebody else. By depositing data your work it also lasts longer than if people just had the paper (your tree is likely to be outdated, that sequence from a rare, hard to obtain species might be used for decades to come).

Note that I'm not saying a database of trees isn't a good idea, but there seems to be an assumption that it is so obvious that it doesn't need justification. Demonstrably this isn't the case. Maybe we should figure out what we'd want to do with such a database, then tackle how we'd make that possible. For example, I'd want to query a phylogeny database geographically (show me trees from this part of the globe), by ecological association (find the trees for any parasites on this clade), by temporal period (what clades originated in the Miocene?), by data (what trees used this sequence which we now know is chimeric?), by topology (have we settled on the sister group to snakes yet?), and so on. I would also argue that much of this is doable, but might not actually require archiving published phylogenies. Personally I think anybody tackling these questions would do well to use PhyLoTA as their starting point.

TreeBASE sucks

Yes, I'm as sick of saying this as you are of reading it. But it doesn't change the fact that just about everything about TreeBASE from the complexity of the underlying data model, the choice of programming language, the use of a Java applet to display trees, the Byzantine search interface, and the voluminous XML output make TreeBASE a bag of hurt. None of this would matter much if it was an indispensable part of people's research toolkit, but this isn't the case. If you are trying to convince people of the benefits of sharing trees you really want a tool that makes a it seem a no brainer. We aren't there yet.

The "fuck this" point

In a great post on the piracy threshold, Matt Gemmell argues that piracy is largely the fault of content providers because they make being honest too difficult. How many times have you wanted to buy something such as a book or a movie only to discover that the content provider doesn't sell it in your part of the world (e.g., in the iBooks store in the US but not the UK) or doesn't provide it in the media you want (e.g., DVD but not online)? To top it off every time you go to the movies you are subjected to emotional blackmail or threats of unlimited fines if you were to copy the movie you already paid to watch?

6892585935 32d4e21e77 o

I think databases have the same "fuck this" threshold. If you are asking people to submit data you want to make it as easy as possible. And you want at least some of the benefits to be immediate and obvious. Otherwise you are left with coercing people, and that's being, at best, lazy.

If you want an example of how to do it right, look at Mendeley's model. They want to build a public cloud of academic papers, a laudable goal, the Right Thing™ to do. But they sell the idea not as a public good, not as the Right Thing™, nor by trying to compel people (they can't, they're a private company). Instead they address a major point of pain - where the hell did I put that PDF? - and make it trivial to organise your collection of articles. Then they make it possible to back them up to the cloud, to view them on multiple devices, to share them, and viola, we get a huge database of publications. The sociology works. So, my question is, what would the equivalent be for phylogenetics?

Friday, July 20, 2012

Figshare and F1000 integrate data into publication: could TreeBASE do the same?

Quick thoughts on the recent announcement by figshare and F1000 about the new journals being launched on the F1000 Research site. The articles being published have data sets embedded as figshare widgets in the body of the text, instead of being, say, a static table. For example, the article:

Oliver, G. (2012). Considerations for clinical read alignment and mutational profiling using next-generation sequencing. F1000 Research. doi:10.3410/f1000research.1-2.v1

has a widget that looks like this:

Widget

You can interact with this widget to view the data. Because the data are in figshare those data are independently citable, e.g. the dataset "Simulated Illumina BRCA1 reads in FASTQ format" has a DOI http://dx.doi.org/10.6084/m9.figshare.92338.

Now, wouldn't it be cool if TreeBASE did something similar? Imagine if uploading trees to TreeBASE were easy, and that you didn't have to have published yet, you just wanted to store the trees and make them citable. Imagine if TreeBASE had a nice tree viewer (no, not a Java applet, a nice viewer that uses SVG, for exmaple). Imagine if you could embed that tree viewer as a widget when you published your results. It's a win all round. People have an incentive to upload trees (nice viewer, place to store them, and others can cite the trees because they'd have DOIs). TreeBASE builds its database a lot more quickly (make it dead easy to upload tree), and then as more publishers adopt this style of publishing TreeBASE is well placed to provide nice visualisations of phylogenies pre-packaged, interactive, and citable. And let's not stop there, how about a nice alignment viewer? Perhaps this is the something currently rather moribund PLoS Currents Tree of Life could think about supporting?

Saturday, June 02, 2012

Linking NCBI taxonomy to GBIF

@rdmpage, given an NCBI taxon ID, how to get GBIF occurrence records via web service?
— Rutger Vos (@rvosa) May 30, 2012

In response to Rutger Vos's question I've started to add GBIF taxon ids to the iPhylo Linkout website. If you've not come across iPhylo Linkout, it's a Semantic Mediawiki-based site were I maintain links between the NCBI taxonomy and other resources, such as Wikipedia and the BBC Nature Wildlife finder. For more background see

Page, R. D. M. (2011). Linking NCBI to Wikipedia: a wiki-based approach. PLoS Currents, 3, RRN1228. doi:10.1371/currents.RRN1228

I'm now starting to add GBIF ids to this site. This is potentially fraught with difficulties. There's no guarantee that the GBIF taxonomy ids are stable, unlike NCBI tax_ids which are fairly persistent (NCBI publish deletion/merge lists when they make changes). Then there are the obvious problems with the GBIF taxonomy itself. But, if you want a way to generate a distribution map for a taxon in the NCBI taxonomy, the quickest way is going to be via GBIF.

The mapping is being made automatically, with some crude checks to try and avoid too many erroneous links (e.g., due to homonyms). It will probably take a few days to complete (the mapping is quick, uploading to the wiki is a bit slower). Using a wiki to manage the mapping makes it easy to correct any spurious matches.

As an example, the page http://iphylo.org/linkout/Ncbi:109175 is for the frog Hyla japonica (NCBI tax_id 109175) and shows links to Wikipedia (http://en.wikipedia.org/wiki/Japanese_Tree_Frog, and to GBIF (http://data.gbif.org/species/2427601/). There's even a link to TreeBASE. I display a GBIF map so you can see what data GBIF currently has for that taxon.

Hyla

So, we have a wiki page, how do we answer Rutger's original question: how to get GBIF occurrence records via web service?

To do this we can use the RDF output by the Semantic Mediawiki software that underpins the Wiki. You can gte this by clicking on the RDF icon near the bottom of the page, or go to http://iphylo.org/linkout/Special:ExportRDF/Ncbi:109175. The RDF this produces is really, really ugly (and people wonder why the Semantic Web has been slow to take off...). In this RDF you will see the statement:


<rdfs:seeAlso rdf:resource="http://data.gbif.org/species/2427601/"/>

So, arm yourself with XPath, a regular expression, or if you are a serious RDF geek break out the SPARQL, and you can extract the GBIF taxon id for a NCBI taxon. Given that id you can query the GBIF web services. One service that I like is the occurrence density service, which you can use to recreate the 1°×1° density maps shown by GBIF. For example, http://data.gbif.org/ws/rest/density/list?taxonconceptkey=2427601 will get you the squares shown in the screen shot above.

Of course, I have glossed over several issues, such as the errors and redundancy in the GBIF classification, the mismatch between NCBI and GBIF classifications (NCBI has many more ranks than GBIF), and whether the taxon concepts used by the two databases are equivalent (this is likely to be more of an issue for higher taxa). But it's a start.

Tuesday, February 21, 2012

Linking GBIF and Genbank

As part of my mantra that it's not about the data, it's all about the links between the data, I've started exploring matching GenBank sequences to GBIF occurrences using the specimen_voucher codes recorded in GenBank sequences. It's quickly becoming apparent that this is not going to be easy. Specimen codes are not unique, are written in all sorts of ways, there are multiple codes for the same specimen (GenBank sequences may be associated with museum catalogue entries, or which field or collector numbers).

So why undertake what is fast looking like a hopeless task? There are several reasons:

GBIF occurrences have a unique URL which we could potentially use as a unique, resolvable identifier for the corresponding specimen.
Linking GenBank to GBIF would make it possible for GBIF to list sequences associated with a specimen, as well as the associated publication, which means we could demonstrate the "impact" of a specimen. In the simplest terms this could be the number of sequences and publications that use data from the specimen, more sophisticated approaches could use PageRank-like measures, see hdl:10101/npre.2008.1760.1.
Having a unique identifier that is shared across different databases makes it easier to combine data from different sources. For example, if a sequence in GenBank lacks geographic coordinates but the voucher specimen in GBIF is georeferenced, we can use that information to locate the sequence in geographic space (and hence build geophylogenies or add spatial indexes to databases such as TreeBASE). Conversely, if the GenBank sequence is georeferenced but the GBIF record isn't we can update the GBIF record and possibly expand the range of the corresponding taxon (this was part of the motivation behind hdl:10101/npre.2009.3173.1.

As an example, below is the GBIF 1° density map for the frog Pristimantis ridens from GBIF, with the phylogeny from Wang et al.Phylogeography of the Pygmy Rain Frog (Pristimantis ridens) across the lowland wet forests of isthmian Central Americahttp://dx.doi.org/10.1016/j.ympev.2008.02.021 layered over it. I created the KML tree from the corresponding tree in TreeBASE using the tool I described earlier. You can grab the KML for the tree here.

Density

As we'd expect, there is a lot of overlap in the two sources of data. If we investigate further, there are records that are in fact based on the same specimen. For example, if we download the GBIF KML file with individual placemarks we see that in the northern part of the range their are 15 GBIF occurrences that map onto the same point as one of the terminal taxa in the tree.

Gbif

One of these 15 GBIF records (http://data.gbif.org/occurrences/244335848) is for specimen USNM 514547, which is the voucher specimen for EU443175. This gives us a link between the record in GBIF and the record in GenBank. It also gives us a URI we can use for the specimen http://data.gbif.org/occurrences/244335848 instead of the unresolvable and potentially ambiguous USNM 514547.

If we view the geophylogeny from a different vantage point we see numerous localities that don't have occurrences in GBIF.

Nogbif

Close inspection reveals that some of the specimens listed in the Wang et al. paper are actually in GBIF, but lack geographic coordinates. For example the OTU "Pristimantis ridens Nusagandi AJC 0211" has the voucher specimen FMNH 257697. This specimen is in GBIF as http://data.gbif.org/occurrences/57919777/, but without coordinates, so it doesn't appear on the GBIF map. However, both the Wang et al. paper and the GenBank record for the sequence from this specimen EU443164 give the latitude and longitude. In this example, GBIF gives us a unique identifier for the specimen, and GenBank provides data on location that GBIF lacks.

Part of GBIFs success is due to the relative ease of integrating data by taxonomic names (despite the problems caused by synonyms, homonyms, misspellings, etc.) or using spatial coordinates (which immediately enables integration with environmental data. But if we want to integrate at deeper levels then specimen records are the glue that connects GBIF (and its contributing data sources) to sequence databases, phylogenies, and the taxonomic literature (via lists of material exampled). This will not be easy, certainly for legacy data that cites ambiguous specimen codes, but I would argue that the potential rewards are great.

Thursday, February 02, 2012

Browsing TreeBASE using a genome browser-like interface

One of the things I find frustrating about TreeBASE is that there's no easy way to get an overview of what it contains. What is it's taxonomic coverage like? Is it dominated by plants and fungi, or are there lots of animal trees as well? Are the obvious gaps in our phylogenetic knowledge, or do the phylogenies it contains pretty much span the tree of life?

As part of my phyloinformatics course I've put together a simple browser to navigate through TreeBASE. The inspiration comes from genome browsers (e.g., the UCSC Genome Browser) where the genome is treated as a linear set of co-ordinates, and features of the genome are displayed as "tracks".

Hgt genome 596a ac7fe0

For my browser, I've used the order in which nodes appear in the NCBI tree as you go from left to right as the set of co-ordinates (actually, from top to bottom as my browser displays the co-ordinate axis vertically).

Browser

I then place each TreeBASE tree within this classification by taking the TreeBASE → NCBI mapping provided by TreeBASE and finding the "majority rule" taxon for each tree (in a sense, the taxa that summarises what the tree is about). Each tree is represented by a vertical line depicting the span of the corresponding NCBI taxon (corresponding to a "track" in a genome browser). Taking the majority-rule taxon rather than say, the span of the tree, makes it possible to pack the vertical lines tightly together so that they take up less space (the ordering from left to right is determined by the NCBI taxonomy).

If you mouse-over a vertical bar you can see the title of the study that published the tree. If you click on the vertical bar you'll see the tree displayed on the right (if your web browser understands SVG, that is). If you click on the background you will drill down a level in the NCBI classification. To go back up the classification, click on the arrow at the top left of the browser.

This is all very preliminary, but you can take it for a spin at http://iphylo.org/~rpage/phyloinformatics/treebase/.

Below is a short video walking you through some examples.

Thursday, March 24, 2011

TreeBASE meets NCBI, again

Déjà vu is a scary thing. Four years ago I released a mapping between names in TreeBASE and other databases called TBMap (described here: doi:10.1186/1471-2105-8-158). Today I find myself releasing yet another mapping, as part of my NCBI to Wikipedia project. By embedding the mapping in a wiki, it can be edited, so the kinds of problems I encountered with TbMap, recounted here, here, and here. The mapping in and of itself isn't terribly exciting, but it's the starting point for some things I want to do regarding how to visualise the data in TreeBASE.

Because TreeBASE 2 has issued new identifiers for its taxa (see TreeBASE II makes me pull my hair out), and now contains its own mapping to the NCBI taxonomy, as a first pass I've taken their mapping and added it to http://iphylo.org/linkout. I've also added some obvious mappings that TreeBASE has missed. There are a lot more taxa which could be added, but this is a start.

The TreeBASE taxa that have a mapping each get their own page with a URL of the form http://iphylo.org/linkout/<TreeBase taxon identifier>, e.g. http://iphylo.org/linkout/TB2:Tl257333. This page simply gives the name of the taxon in TreeBASE and the corresponding NCBI taxon id. It uses a Semantic Mediawiki template to generate a statement that the TreeBASE and and NCBI taxa are a "close match". If you go to the corresponding page in the wiki for the NCBI taxon (e.g., http://iphylo.org/linkout/Ncbi:448631) you will see any corresponding TreeBASE taxa listed there. If a mapping is erroneous, we simply need to edit the TreeBASE taxon page in the wiki to fix it. Nice and simple.

At the time of writing the initial mapping is still being loaded (this can take a while). I'll update this post when the uploading has finished.

Wednesday, December 15, 2010

TreeBASE, again

My views on TreeBASE are pretty well known. Lately I've been thinking a lot about how to "fix" TreeBASE, or indeed, move beyond it. I've made a couple of baby steps in this direction.

The first step is that I've created a group for TreeBASE papers on Mendeley. I've uploaded all the studies in TreeBASE as of December 13 (2010). Having these in Mendeley makes it easier to tidy up the bibliographic metadata, add missing identifiers (such as DOIs and PubMed ids), and correct citations to non-existent papers (which can occur if at the time the authors uploaded their data the planned to submit their paper to one journal, but it ending up being accepted in another). If you've a Mendeley account, feel free to join the group. If you've contributed to TreeBASE, you should find your papers already there.

The second step is playing with CouchDB (this years new hotness), exploring ways to build a database of phylogenies that has nothing much to do with either a relational database or a triple store. CouchDB is a document store, and I'm playing with taking NeXML files from TreeBASE, converting them to something vaguely usable (i.e., JSON), and adding them to CouchDB. For fun, I'm using my NCBI to Wikipedia mapping to get images for taxa, so if TreeBASE has mapped a taxon to the NCBI taxonomy, and that taxon has a page in Wikipedia with an image, we get an image for that taxon. The reason for this is I'd really like a phylogeny database that was visually interesting. To give you some examples, here are trees from TreeBASE (displayed using SVG), together with thumbnails of images from Wikipedia:

Everything (tree and images) is stored within a single document in CouchDB, making the display pretty trivial to construct. Obviously this isn't a proper interface, and there's things I'd need to do, such as order the images in such a way that they matched the placement of the taxa on the tree, but at a glance you can see what the tree is about. We could then envisage making the images clickable so you could find out more about that taxon (e.g., text from Wikipedia, lists of other trees in the database, etc.).

We could expand this further by extracting geographical information (say, from the sequences included in the study) and make a map, or eventually a phylogeny on Google Earth) (see David Kidd's recent "Geophylogenies and the Map of Life" for a manifesto doi:10.1093/sysbio/syq043).

One of the big things missing from databases like TreeBASE is a sense of "fun", or serendipity. It's hard to find stuff, hard to discover new things, make new connections, or put things in context. And that's tragic. Try a Google image search for treebase+phylogeny:

Call me crazy, but I looked at that and thought "Wow! This phylogeny stuff is cool!" Wouldn't it be great if that's the reaction people had when they looked at a database of evolutionary trees?

Thursday, July 08, 2010

Show me the trees! Playing with the TreeBASE API

Being in an unusually constructive mood, I've spent the last couple of days playing with the TreeBASE II API, in an effort to find out how hard it would be to replace TreeBASE's frankly ghastly interface.

After some hair pulling and bad language I've got something to work. It's very crude, but gives a glimpse at what can be done. If you visit http://iphylo.org/~rpage/mytreebase/ and enter a taxon name, my code paddles off and queries TreeBASE to see if it has any phylogenies for that taxon. Gears grind, RSS feeds are crunched, a triple store is populated, NEXUS files are grabbed and Newick trees extracted, small creatures are needlessly harmed, and at last some phylogeny thumbnails are rendered in SVG (based on code I mentioned earlier), grouped by study. Functionality is limited (you can't click on the trees to make them bigger, for example), and the bibliographic information TreeBASE stores for studies is a bit ropey, but you get the idea.

What I'm looking for at this stage is a very simple interface that answers the question "show me the trees", which I think is the most basic question you can ask of TreeBASE (and one its own web interface makes unnecessarily hard). I've also gained some inspiration from the BioText search engine.

If you want to give it a try, here are some examples. These examples should be fairly responsive as the data is cached, but if you try searching for other taxa you may have a bit of a wait while my code talks to TreeBASE.

Wednesday, June 02, 2010

TreeBASE II RDF

One of the potentially powerful features of TreeBASE II is availability of a RDF version of a study. This means that, in principle, one could take the RDF for a TreeBASE study, combine it with RDF from other sources, and generate a richer view of a particular study. For example, if a TreeBASE study has a DOI, then we could link it to bibliographic details for the study, and through them to other information, such as GenBank sequences, specimens, etc. (see my little linked data browser for an example of some of this linking). If we added a phylogeny viewer, then we'd have a great tool for browsing the basic components of a phylogenetic study.

Unfortunately, we're not there yet. I've been trying to make sense of TreeBASE II RDF, and frankly, it's a mess. Here are some of the problems:

TreeBASE URIs aren't linked data compliant
The canonical URI for a study (e.g., http://purl.org/phylo/treebase/phylows/study/TB2:S10423) doesn't conform to the linked data approach. In fact, the URI crashes the linked data validator, so I tried another test.


curl --include 
   --header "Accept: application/rdf+xml" 
   http://purl.org/phylo/treebase/phylows/study/TB2:S10423

To be a valid linked data resource this request should return a 303 HTTP status code. Instead we get a 302 and some HTML. Linked data clients won't be able to extract information from this URI.

SKOS matching
There are some odd things going on in the RDF. It contains statements of the form:


<rdf:Description rdf:ID="otu1789319">
   <skos:closeMatch rdf:resource="http://purl.uniprot.org/taxonomy/76066.rdf">
</rdf:Description>

(I've tidied this up a little from the original, rather verbose RDF). This asserts that the TreeBASE OTU otu1789319 corresponds to the NCBI taxon with the taxonomy id 76066 (represented by the Uniprot URI). Except, it doesn't really. As far as I understand it, SKOS is about matching concepts, not documents. The URI http://purl.uniprot.org/taxonomy/76066.rdf is a document URI (specifically, a RDF document), the URI http://purl.uniprot.org/taxonomy/76066 is the taxon. The match should really be to http://purl.uniprot.org/taxonomy/76066. Then I've come across statements that match TreeBASE OTUs to http://purl.uniprot.org/taxonomy/0.rdf. This URI doesn't exist (we get a 404). This seems an odd way to say that we don't have a match -- if we don't have a match, don't include it in the RDF.

Local URIs for trees don't work
The RDF is full of local URIs such as http://purl.org/phylo/treebase/phylows/#tree1790755, which don't resolve. In fact they generate a rather spectacular Tomcat exception. I don't understand why we need local URIs. Everything in TreeBASE should have a global URI. Then we can avoid unnecessary statements such as:

http://purl.org/phylo/treebase/phylows/#tree1790755 owl:sameAs http://purl.org/phylo/treebase/phylows/tree/TB2:Tr7899

which links a local resource to a global one http://purl.org/phylo/treebase/phylows/tree/TB2:Tr7899. Incidentally, this URI doesn't resolve, despite claims that this bug has been fixed.

No links between tree and study
But the show stopper for me is that there is no link between a study and a tree! There is no triple in the RDF specifying any relationship between these two entities. To me this is just about the most important thing I need. I want to be able to query TreeBASE RDF using a study identifier (either from TreeBASE itself, or from an external identifier such as a DOI or a PubMed number). As it stands the TreeBASE II RDF is almost useless. I can't get it via a linked data client, it's full of URIs that don't resolve, and it lacks key triples that would glue things together.

RDF != XML

I can't help thinking that the RDF output hasn't been designed with end use in mind. I know from my own experience that it's not until you try to do something with the RDF that you realise how poor some design decisions may have been.

It's not enough to pump out RDF and hope for the best. RDF is not XML, which is just a verbose format for moving data around. RDF brings with it all sorts of expectations about how clients will resolve it, how they will interpret URIs, and the kinds of queries that will be performed. We are achingly close to being able to tie everything together, but not with RDF TreeBASE II is currently making available.

Tuesday, May 25, 2010

TreeBASE II makes me pull my hair out

I've been playing a little with TreeBASE II, and the more I do the more I want to pull my hair out.

Broken URLs
The old TreeBASE had a URL API, which databases such as NCBI made use of. For example, the NCBI page for Amphibolurus nobbi has a link to this taxon in TreeBASE. The link is http://www.treebase.org/cgi-bin/treebase.pl?TaxonID=T31183&Submit=Taxon+ID. Now, this is a fragile looking link to a Perl CGI script, and sure enough, it's broken. Click on it and you get a 404. In moving to the new TreeBASE II, all these inward links have been severed. At a stroke TreeBASE has cut itself off from an obvious source of traffic from probably the most important database in biology. Please, please, throw in some mod_rewrite and redirect these CGI calls to TreeBASE II.

New identifiers
All the TreeBASE studies and taxa have new identifiers. Why? Imagine if GenBank decided to trash all the accession numbers and start again from scratch. TreeBASE II does support "legacy" StudyIDs, so you can find a study using the old identifier (you know, the one people have cited in their papers). But there's no support for legacy TaxonIDs (such as T31183 for Amphibolurus nobbi). I have to search by taxon name. Why no support for legacy taxon IDs?

Dumb search
Which brings me to search. The search interface for taxa in TreeBASE is gloriously awful:

So, I have to tell the computer what I'm looking for. I have to tell it whether I'm looking for an identifier or doing a text search, then within those categories I need to be more specific: do I want a TreeBASE taxon ID (new ones of course, because the old ones have gone), NCBI id, or uBio? And this is just the "simple" search, because there's an option for "Advanced search" below.

Maybe it's just me, I get really annoyed when I'm asked to do something that a computer can figure out. I shouldn't have to tell a computer that I'm searching for a number or some text, nor should I tell it what that number of text means. Computers are pretty good at figuring that stuff out. I want one search box, into which I can type "Amphibolurus nobbi", or "Tx1294" or "T31183" or "206552" or "6457215" or "urn:lsid:ubio.org:namebank:6457215" (or a DOI, or a text string, or pretty much anything) and the computer does the rest. I don't ever want to see this:

Computers are dumb, but they're not so dumb that they can't figure out if something is a number or not. What I want is something close to this:

Is this really too much to ask? Can we have a search interface that figures out what the user is searching for?

Note to self: Given that TreeBASE has an API, I wonder how hard it would be to knock up a tool that took a search query, ran some regular expressions to figure out what the user might be interested in, then hit the API with that search, and returned the results?

My concern here is that TreeBASE II is important, very important. Which means it's important to make it usable, which means don't break existing URLs, don't make old identifiers disappear, and don't have a search interface that makes me want to pull my hair out.

Friday, March 26, 2010

TreeBASE II has been released

The TreeBASE team have announced that TreeBASE II has been released. I've put part of the announcement on the SSB web site. Given that TreeBASE and I have history, I think it best to keep quiet and see what others think before blogging about it in detail. Plus, there's a lot of new features to explore. Take it for a spin and see what you think.

Tuesday, March 16, 2010

Progress on a phylogeny wiki

I've made some progress on a wiki of phylogenies. Still much to do, but here are some samples of what I'm trying to do.

First up, here's an example of a publication http://iphylo.org/treebase/Doi:10.1016/j.ympev.2008.02.021:

In addition to basic bibiographic details we have links to GenBank sequences and a phylogeny. The sequences are georeferenced, which enables us to generate a map. At a glance we see that the study area is central America.

This study published the following tree:

The tree is displayed using my tvwidget. A key task in constructing the wiki is mapping labels used in TreeBASE to other taxonomic names, for example, those in the NCBI taxonomy database. This is something I first started working on in the TbMap project (doi:10.1186/1471-2105-8-158). In the context of this wiki I'm explicitly mapping TreeBASE taxa to NCBI taxa. Taxa are modelled as "namestrings" (simple text strings), OTUs (TreeBASE taxa), and taxonomic concepts (sets of observations or taxa). For example, the tree shown above has numerous samples of the frog Pristimantis ridens, each with a unique label (namestring) that includes, in this case, voucher specimen information (e.g., "Pristimantis ridens La Selva AJC0522 CR"). Each of these labels is mapped to the NCBI taxon Pristimantis ridens.

One thing I'm interested in doing is annotating the tree. Eventually I hope to generate (or make it easy to generate) things such as Google Earth phylogenies (via georeferenced sequences and specimens). For now I'm playing with generating nicer labels for the terminal taxa. As it stands if you download the original tree from TreeBASE you have the original study-specific labels (e.g., "Pristimantis ridens La Selva AJC0522 CR"), whereas it would be nice to also have taxonomic names (for example, if you wanted to combine the tree or data with another study). Below the tree you'll see a NEXUS NOTES block with the "ALTTAXNAMES" command. The program Mesquite can use this command to enable users to toggle between different labels, so that you can have either a tree like this:

or a tree like this:

Monday, March 15, 2010

How Wikipedia can help scope a project

I'm revisiting the idea of building a wiki of phylogenies using Semantic Mediawiki. One problem with a project like this is that it can rapidly explode. Phylogenies have taxa, which have characters, nucleotides sequences and other genomics data, and names, and come from geographic locations, and are collected and described by people, who may deposit samples in museums, and also write papers, which are published in journals, and so on. Pretty soon, any decent model of a phylogeny database is connected to pretty much anything of interest in the biological sciences. So we have a problem of scope. At what point do we stop adding things to the database model?

It seems to me that Wikipedia can help. Once we hit a topic that exists in Wikipedia, then we can stop. It's a reasonable bet that either now, or at some point in the future, the Wikipedia page is likely to be as good as, or better than, anything a single project could do. Hence, there's probably not much point storing lots of information about genes, countries, geographic regions, people, journals, or even taxa, as Wikipedia has these. This means we can focus on gluing together the core bits of a phylogenetic study (trees, taxa, data, specimens, publications) and then link these to Wikipedia.

In a sense this is a variation on the ideas explored in EOL, the BBC, and Wikipedia, but in developing my wiki of phylogenies project (this is the third iteration of this project) it's struck me how the question "is this in Wikipedia?" is the quickest way to answer the question "should I add x to my wiki?" Hence, Wikipedia becomes an antidote to feature bloat, and helps define the scope of a project more clearly.

Thursday, September 17, 2009

Towards a wiki of phylogenies

At the start of this week I took part in a biodiversity informatics workshop at the Naturhistoriska riksmuseets, organised by Kevin Holston. It was a fun experience, and Kevin was a great host, going out of his way to make sure myself and other contributors were looked after. I gave my usual pitch along the lines of "if you're not online you don't exist", and talked about iSpecies, identifiers, and wikis.

I also ran a short, not terribly successful exercise using iTaxon to demo what semantic wikis can do. As is often the case with something that hasn't been polished yet, the students found the mechanics of doing things less than intuitive. I need to do a lot of work making data input easier (to date I've focussed on automated adding of data, and forms to edit existing data). Adding data is easy if you know how, but the user needs to know more than they really should have to.

The exercise was to take some frog taxa from the Frost et al. amphibian tree (doi:10.1206/0003-0090(2006)297[0001:TATOL]2.0.CO;2) and link them to GenBank sequences and museum specimens. The hope was that by making these links new information would emerge. You could think of it as an editable version of this. With a bit of post-exercise tidying, we got someway there. The wiki page for the Frost et al.
paper now shows a list of sequences from that paper (not all, I hasten to add), and a map for those sequences that the students added to the wiki:

Although much remains to be done, I can't help thinking that this approach would work well for a database like TreeBASE, where one really needs to add a lot of annotation to make it useful (for example, mapping OTUs to taxon names, linking data to sequences and specimens). So, one of the things I'm going to look at is dumping a copy of TreeBASE (complete with trees) into the wiki and seeing what can be done with it. Oh, and I need to make it much, much easier for people to add data.

Tuesday, August 18, 2009

To wiki or not to wiki?

What follows are some random thoughts as I try and sort out what things I want to focus on in the coming days/weeks. If you don't want to see some wallowing and general procrastination, look away now.

I see four main strands in what I've been up to in the last year or so:

services
mashups
wikis
phyloinformatics

Let's take these in turns.

Services
Not glamourous, but necessary. This is basically bioGUID (see also hdl:10101/npre.2009.3079.1). bioGUID provides OpenURL services for resolving articles (it has nearly 84,000 articles in it's cache), looking up journal names, resolving LSIDs, and RSS feeds.

Mashups
iSpecies is my now aging tool for mashing up data from diverse sources, such as Wikipedia, NCBI, GBIF, Yahoo, and Google Scholar. I tweak it every so often (mainly to deal with Google Scholar forever mucking around with their HTML). The big limitation of iSpecies is that it doesn't make it's results reusable (i.e., you can't write a script to call iSpecies and return data). However, it's still the place I go to to quickly find out about a taxon.

The other mashups I've been playing with focus on taking standardised RSS feeds (provided by bioGUID, see above) and mashing them up, sometimes with a nice front end (e.g., my e-Biosphere 09 challenge entry).

Wiki
I've invested a huge amount of effort in learning how wikis (especially Mediawiki and its semantic extensions) work, documented in earlier posts. I created a wiki of taxonomic names as a sandbox to explore some of these ideas.

I've come to the conclusion that for basic taxonomic and biological information, the only sensible strategy for our community is to use (and contribute to) Wikipedia. I'm struggling to see any justification for continuing with a proliferation of taxonomic databases. After e-Biosphere 09 the game's up, people have started to notice that we've an excess of databases (see Claire Thomas in Science, "Biodiversity Databases Spread, Prompting Unification Call", doi:10.1126/science.324_1632).

Phyloinformatics
In truth I've not been doing much on this, apart from releasing tvwidget (code available from Google Code), and playing with a mapping of TreeBASE studies to bibliographic identifiers (available as a featured download from here). I've played with tvwidget in Mediawiki, and it seems to work quite well.

Where now?
So, where now? Here are some thoughts:

I will continue to hack bioGUID (it's now consuming RSS feeds from journals, as well as Zotero). Everything I do pretty much depends on the services bioGUID provides

iSpecies really needs a big overhaul to serve data in a form that can be built upon. But this requires decisions on what that format should be, so this isn't likely to happen soon. But I think the future of mashup work is to use RDF and triple stores (providing that some degree of editing is possible). I think a tool linking together different data sources (along the lines of my ill-fated Elsevier Challenge entry) has enormous potential.

I'm exploring Wikipedia and Wikispecies. I'm tempted to do a quantitative analysis of Wikipedia's classification. I think there needs to be some serious analysis of Wikipedia if people are going to use it as a major taxonomic resource.

If I focus on Wikipedia (i.e., using an existing wiki rather than try to create my own), then that leaves me wondering what all the playing with iTaxon was for. Well, actually I think the original goal of this blog (way back in December 2005) is ideally suited to a wiki. Pretty much all the elements are in place to dump a copy of TreeBASE into a wiki and open up the editing of links to literature and taxonomic names. I think this is going to handily beat my previous efforts (TbMap, doi:10.1186/1471-2105-8-158), especially as errors will be easy to fix.

So, food for thought. Now, I just need to focus a little and get down to actually doing the work.

Friday, May 22, 2009

Dryad, DOIs, and why data matters more than journal articles

For the last two days I've been participating in a NESCent meeting on Dryad, a "repository of data underlying scientific publications, with an initial focus on evolutionary biology and related fields". The aim of Dryad is to provide a durable home for the kinds of data that don't get captured by existing databases such as GenBank and TreeBASE (for example, the Excel spreadsheets, Word files, and tarballs of data that, if they are lucky, make it on to a journal's web site as supplementary material (like this example). These data have an alarming tendency to disappear (see "Unavailability of online supplementary scientiﬁc information from articles published in major journals" doi:10.1096/fj.05-4784lsf).

Perhaps it was because I was participating virtually (via Adobe Connect, which worked very well), but at times I felt seriously out of step with many of the participants. I got the sense that they regard the scientific article as primary, data as secondary, and weren't entirely convinced that data needed to be treated in the same way as a publication. I was arguing that Dryad should assign DOIs to data sets, join CrossRef, and ensure data sets were cited in the same way as papers. For me this is a no brainer -- by making data equivalent to a publication, journals don't need to do anything special, publishers know how to handle DOIs, and will have fewer qualms than handling URLs, which have a nasty tendency to break (see "Going, Going, Gone: Lost Internet References" doi:10.1126/science.1088234).

Furthermore, being part of CrossRef would bring other benefits. Their cited-by linking service enables publishers to display lists of articles that cite a given paper -- imagine being able to do this for data sets. Dryad could display not just the paper associated with publication of the data set, but all subsequent citations. As an author, I'd love to see this. It would enable me to see what others had done with my data, and provide an incentive to submit my data to Dryad (providing incentives to authors to archive data is a big issue, see Mark Costello's recent paper doi:10.1525/bio.2009.59.5.9).

Not everyone saw things this way, and it's often a "reality check" to discover that things one takes for granted are not at all obvious to others (leading to mutual incomprehension). Many editors, understandably, think of the the journal article as primary, and data as something else (some even struggle to see why one would want to cite data). There's also (to my mind) a ridiculous level of concern about whether ISI would index the data. In the age of Google, who cares? Partly these concerns may reflect the diversity of the participants. Some subjects, such as phylogenetics, are built on reuse of previous data, and it's this reuse that makes data citation both important and potentially powerful (for more on this see my papers hdl:10101/npre.2009.3173.1 and doi:10.1093/bib/bbn022). In many ways, the data is more important than the publication. If I look at a phylogenetics paper published, say 5 or more years ago, the methods may be outmoded, the software obsolete (I might not be able to run it on a modern machine), and the results likely to be outdated (additional data and/or taxa changing the tree). So, the paper might be virtually useless, but the data continues to be of value. Furthermore, the great thing about data (especially sequence data) is that it can be used in all sorts of unexpected ways. In disciplines such as phylogenetics, data reuse is very common. In other areas in evolution and ecology, this might not be the case.

It will be clear from this that I buy the idea articulated by Philip Bourne (doi:10.1371/journal.pcbi.0010034) that there's really no difference between a database and a journal article and that the two are converging (I've argued for a long time that the best thing that could happen to phylogeneics would be if Molecular Phylogenetics and Evolution and TreeBASE were to merge and become one entity). Data submission would equal publication. In the age of Google where data is unreasonably effective (doi:10.1109/mis.2009.36, PDF here), privileging articles at the expense of data strikes me as archaic.

So, whither Dryad? I wish it every success, and I'm sure it will be a great start. There are some very clever people behind it, and it takes a lot of work to bring a community on board. However, I think Dryad's use of Handles is a mistake (they are the obvious choice of identifier given Dryad is based on DSpace), as this presents publishers with another identifier to deal with, and has none of the benefits of DOIs. Indeed, I would go further and say that the use of Handles + DSpace marks Dryad as being basically yet another digital library project, which is fine, but it puts it outside the mainstream of science publishing, and I think that is a strategic mistake. An example of how to do things better is Nature Precedings, which assigns DOIs to manuscripts, reports, and presentations. I think the use of DOIs in this context demonstrated that Nature was serious, and valued these sorts of resource. Personally, I'd argue that Dryad should be more ambitious, and see itself as a publisher, not a repository. In fact, it could think of itself as a journal publisher. Ironically, maybe the editors at the NESCent meeting were well advised to be wary, what they could be witnessing is the formation of a new kind of publication, where data is the article, and the article is data.

Friday, March 06, 2009

Phylogenies in a wiki

I'm slowing trying to get phylogenies into the wiki that I'm playing with. Here's an early example, the TreeBASE tree T6002, from the study A Phylogenomic Study of Birds Reveals Their Evolutionary History. The tree is displayed using my tvwidget. Below are listed the OTUs in the tree in a crude table. The idea is that this table will contain a mapping between OTU labels and taxa. For example, one OTU is labelled "Diomedea nigripes". This links to the page for Diomedea nigripes, which provides some basic information on this name, including the statement that ITIS regards the correct name for this bird to be Phoebastria nigripes (Audubon, 1839). The table on the page showing the tree displays this information as well.

This is all terribly incomplete and crude, but it gives a sense of where this is going. The plan is to import in bulk the trees and the mappings (from, say, TBMap), as well as the names themselves, and associated literature (including the TreeBASE studies) and then the trees will be embedded in richer data about the taxa.