iPhylo: iSpecies

Roderic D. M. Page

Showing posts with label iSpecies. Show all posts

Wednesday, May 09, 2018

iSpecies meets Lifemap

It's been a little quiet on this blog as I've been teaching, and spending a lot of time data wrangling and trying to get my head around "data lakes" and "triple stores". So there are a few things to catch up on, and a few side projects to report on.

I continue to play with iSpecies, which is a simple mashup off biodiversity data sources. When I last blogged about iSpecies I'd added TreeBASE as a source (iSpecies meets TreeBASE). iSpecies also queries Open Tree of Life, and I've always wanted a better way of displaying the phylogenetic context of a species or genus. TreeBASE is great for a detailed, data-driven view, but doesn't put the taxon in a larger context, nor does the simple visualisation I developed for Open Tree of Life.

A nice large-scale tree visualisation is Lifemap (see De Vienne, D. M. (2016). Lifemap: Exploring the Entire Tree of Life. PLOS Biology, 14(12), e2001624. doi:10.1371/journal.pbio.2001624), and it dawned on me that since Lifemap uses the same toolkit (leaflet.js) that I use to display a map of GBIF records, I could easily add it to iSpecies. After looking at the Lifemap HTML I figured out the API call I need to pan the map to given taxon using Open Tree of Life taxon identifiers, and violà, I now have a global tree of life that shows where the query taxon fits in that tree.

Here's a screenshot of a search for Podocarpus showing the first 300 records from GBIF, and the position of Podocarpus in the tree of life. The tree is interactive so you can zoom and pan just like the GBIF map.

Here's another one for the genus Timonius:

Very much still at the "quick and dirty" stage, but I continue to marvel at how much information can be assembled "on the fly" from a few sources, and how much richer this seems than what biodiversity informatics projects offer. There's a huge amount of information that is simpy being missed or under-utilised in this area.

Monday, August 28, 2017

Let’s rise up to unite taxonomy and technology

Let’s rise up to unite taxonomy and technology - I thought it already was? @rdmpage https://t.co/AkNz1GYGbw
— Sjúrður Hammer 🐣 (@sjurdur) August 26, 2017

Holly Bik (@hollybik) has an opinion piece in PLoS Biology entitled "Let’s rise up to unite taxonomy and technology" https://doi.org/10.1371/journal.pbio.2002231 (thanks to @sjurdur for bringing this to my attention).

It's a passionate plea for integrating taxonomic knowledge and "omics" data. In her article Bik includes a mockup of the kind of tool she'd like to see (based in part on Phinch), and writes:

Step 2: Clicking on a specific data point (e.g., an OTU) will pull up any online information associated with that species ID or taxonomic group, such as Wikipedia entries, photos, DNA sequences, peer-reviewed articles, and geolocated species observations displayed on a map.

This sort of plea has been made any times, and reminds me very much of PLoS's own efforts when they wanted to build a "Biodiversity Hub" and biodiversity informatics basically failed them. The hub itself later closed down.. There's clearly a need for a simply way to summarise what we know about a species, but we've yet to really tackle this (on the face of it) fairly simple task.

Quickly summarising the available information about a species was the motivation behind my little tool iSpecies, which I recently reworked to use DBpedia, GBIF, CrossRef, EOL, TreeBASE and OpenTreeofLife as sources. For the nematode featured in Bik's figure (Desmoscolex) there's not a great deal of easily available information (see http://ispecies.org/?q=Desmoscolex). We can get a little more form other sources not queried by iSpecies, such as BioNames, which aggregates the primary taxonomic literature, see http://bionames.org/search/Desmoscolex.

Part of the problem is that taxonomy is fundamentally a "long tail" field, both in terms of the subject matter (a few very well know species, then millions of poorly known species) and our knowledge of those species (a large, scattered taxonomic literature, much of it not yet digitised, although progress is being made). Furthermore, the names of species (and our conception of them) can change, adding an additional challenge.

But I think we can do a lot better. Simple web-based tools like iSpecies can assemble reasonable information from multiple sources (and in multiple languages) on the fly. It would be nice to expand those sources (the more primary sources the better). The current iSpecies tool searches on species name. This works well if the sources being queried mention that name (e.g., in the title of a paper that has a DOI and is indexed by CrossRef). Given that many of the "omics" datasets Bik works with are likely to have dark taxa, what we'll also need is the ability to search, say, using NCBI taxon ids, and retrieve literature linked to sequences for those taxa

It would also be useful to package those up in a simple API that other tools could consume. For example, if I wanted to improve the utility of iSpecies, one approach would be to package up the results in a JSON object. Perhaps even use JSON-LD (with global identifiers for taxa, documents, etc.) to make it possible for consumers to easily integrate that data with their own.

Taxonomy could be on the brink of another golden age—if we play our cards right. As it is reinvented and reborn in the 21st century, taxonomy needs to retain its traditional organismal-focused approaches while simultaneously building bridges with phylogenetics, ecology, genomics, and the computational sciences.

Taxonomy is, of course, doing just this, albeit not nearly fast enough. There are some pretty serious obstacles, some of them cultural, but some of them due to the nature of the problem. Taxonomic knowledge is massively decentralised, mostly non-digital, and many of the key sources and aggregations are behind paywalls. There is also a fairly large "technical debt" to deal with. Ian Mulvany was recently interviewed by PLoS and he emphasised that because academic publishers had been online from early on they were pioneers, but at the same time this left them with a legacy of older technologies and approaches that can sometimes get in the way of new idea. I think taxonomy suffers from some of the same problems. Because taxonomy has long been involved with computers, sometime we needed up betting on the "wrong" solutions. For example, at one time XML was the new hotness, and people invested a lot of effort in developing XML schema, and then ontologies and RDF vocabularies. Meantime much of the web has moved to simple data formats such as JSON, many specialist vocabularies are gathering dust as schema.org takes off, and projects like Wikidata force us to rethink the need to topic-specific databases.

But these are technical details. For me the key point of "Let’s rise up to unite taxonomy and technology" is that it's a symptom of the continued failure of biodiversity informatics to actually address the needs of its users. People keep asking for fairly simple things, and we keep ignoring them (or explaining why it's MUCH harder than people think, which is another way of ignoring them).

Thursday, March 03, 2016

iSpecies meets TreeBASE

I'm continuing to play with the new version of iSpecies, seeing just how far one can get by simply grabbing JSON from various sources and mashing them up. Since the Open Tree of Life is pretty unresolved ("OMG it's full of stars") I've started to grab trees from TreeBASE and add those. Sadly TreeBASE is showing it's age and doesn't have a JSON API, so I had to break my rule of only using HTML and Javascript in iSpecies and I had to write some PHP wrappers to talk to TreeBASE. Now, when you search for a genus or species you may see a list of studies from TreeBASE, and a popup menu where you can select a tree to view.

Below is a example (searching for the plant genus Fitzalania). Ispecies treebase

This example shows one reason phylogenies are useful. Although GBIF (which supplies the data for the map) recognises Fitzalania, a recent study in TreeBASE shows that this renders Meiogyne paraphyletic, and so moves the Fitzalania to Meiogyne. Hence GBIF's taxonomy is somewhat behind the current state of knowledge about these plants.

The paper merging these two genra (doi:10.1600/036364414x680825) also shows up in the CrossRef results. Unfortunately TreeBASE doesn't have the DOI for the paper, so linking these two results (the TreeBASE study and the corresponding paper) will require some work. This is another reason why I'm playing with iSpecies: I want to see how many identifiers we can uncover to connect results from different sources, and how many cross links we need to add before it all comes together in a nice linked graph of data.

Monday, January 25, 2016

iSpecies is back: mashing up species data

A decade ago (OMG, that can't be right, an actual decade ago) I created "iSpecies", a simple little tool to mashup a variety of data from GBIF, NCBI, Yahoo, Wikipedia, and Google Scholar to create a search engine for species. It was written in PHP, relied on some degree of *cough* web scraping to get its data, and was a bit of a toy (although that didn't stop me complaining that it could do more than EOL at the time). Eventually I got sick of dealing with Google Scholar constantly changing it's HTML and blocking IP addresses to stop people harvesting data (I once managed to get my entire campus blocked), or services disappearing such as Yahoo's image search, and I eventually pulled the plug on it.

A short course I run on "phyloinformatics" starts this week and one of the examples I show is a crude Javascript-based mashup. It struck me that I could tweak that and recreate a simple version of iSpecies, and that's exactly what I've done: http://ispecies.org.

It's nothing fancy, just takes a species name and searches GBIF, EOL, CrossRef, and Open Tree of Life, grabs some data and puts it together on a web page. There are lots of limitations (e.g., only fetches the first 300 localities in GBIF, requires scientific names, tree viewer is pretty awful) but it was pretty simple to put together. It's entirely client-side based, the code is all in the HTML file (and a few Javascript libraries) (the code is on GitHub: https://github.com/rdmpage/ispecies).

Fun as this was, there's a bigger problem with iSpecies and that's that it is a "mashup". I'm simply grabbing data from different sources and redisplaying it. What I really want is what has been described as a "mashup" (awful term, don't use it), that is, I want to combine the data so that it is more than the sum of its parts. For example, some of the data could be cross linked (especially if add a few more sources and we drill down a bit). Some of the papers discovered by CrossRef may include original descriptions, or may be the source of some of the points plotted on the GBIF map. Some may include the phylogenies used to build the Open Tree of Life tree. In order to build a data mashup instead of a web mashup we need to operate at the level of data rather than just human-readable web pages. That is the next thing I'd like to work on, and in many ways it shouldn't be a big leap. The new iSpecies was fairly easy to create because we now have a bunch of web services that all speak JSON. It's a small step from JSON to JSON-LD (especially if the JSON-LD is constructed with reuse in mind). So while it's nice to see iSpecies back, there's a much more interesting next step to think about.

Monday, May 10, 2010

Referring to a one-degree square in RDF using c-squares

I'm in the midst of rebuilding iSpecies (my mash-up of Wikipedia, NCBI, GBIF, Yahoo, and Google search results) with the aim of outputting the results in RDF. The goal is to convert iSpecies from a pretty crude "on-the-fly" mash-up to a triple store where results are cached and can be queried in interesting ways. Why? Partly because I think such a triple store is an obvious way to underpin a "biodiversity hub" of the kind envisaged by PLoS (see my earlier post).

As ever, once one embarks down the RDF route (and I've been here before), one hits all the classic stumbling blocks, such as "what URI do I use for a thing?", and "what vocabulary do I use to express relationships between things?". For example, I'd like to represent the geographic distribution of a taxon as depicted on a GBIF map. How do I describe this in a RDF document?

To make this concrete, take one of my favourite animals, the New Zealand mud crab Helice crassa. Here's the GBIF map for this taxon:

This map has the URL (I kid you not):


http://ogc.gbif.org/wms?request=GetMap
&bgcolor=0x666698
&styles=,,,
&layers=gbif:country_fill,gbif:tabDensityLayer,gbif:country_borders,gbif:country_names
&srs=EPSG:4326
&filter=()(
%3CFilter%3E
%3CPropertyIsEqualTo%3E
%3CPropertyName%3Eurl
%3C/PropertyName%3E
%3CLiteral%3E
%3C![CDATA[http%3A%2F%2Fdata.gbif.org%2Fmaplayer%2Ftaxon%2F17462693%2FtenDeg%2F-45%2F160%2F]]%3E
%3C/Literal%3E
%3C/PropertyIsEqualTo%3E
%3C/Filter%3E)()()
&width=721
&height=362
&Format=image/png
&bbox=160,-45,180,-35

(or http://bit.ly/cuTFW9, if you prefer). Now, there's no way I'm using this URL! Plus, the URL identifies an image, not the distribution.

But, if we look at the map we see that it is made of 1° × 1° squares. If each of those had a URI then I could simply list those squares as the distribution of the crab. This seems straightforward as GBIF has a service that provides these squares. For example, the URL http://data.gbif.org/species/17462693 (where 17462693 corresponds to Helice crassa) returns:


MINX	MINY	MAXX	MAXY	DENSITY
167.0	-45.0	168.0	-44.0	5
174.0	-42.0	175.0	-41.0	20
174.0	-38.0	175.0	-37.0	17
174.0	-37.0	175.0	-36.0	4

These are the 1° × 1° squares for which there are records of Helice crassa. Now, what I'd like to do is have a URI for each square, and I'd like to do this without reinventing the wheel. I've come across a URI space for points of the globe (the WGS 84 Geographic Point URI Space"), but not one for polygons. Then it dawned on me that perhaps c-squares, developed by Tony Rees at the CSIRO in Australia, would do the trick¹. To quote Tony:

C-squares is a system for storage, querying, display, and exchange of "spatial data" locations and extents in a simple, text-based, human- and machine- readable format. It uses numbered (coded) squares on the earth's surface measured in degrees (or fractions of degrees) of latitude and longitude as fundamental units of spatial information, which can then be quoted as single squares (similar to a "global postcode") in which one or more data points are located, or be built up into strings of codes to represent a wide variety of shapes and sizes of spatial data "footprints".

C-squares appeal partly (and this says nothing good about me) because they have a slightly Byzantine syntax. However, they are short, and quite easy to calculate. I'll let the reader find out the gory details. To give an example, my home town, Auckland, has latitude -36.84, longitude 174.74, which corresponds to the 1° × 1° c-square with the code 3317:364.

Now, all I need to do is convert c-squares into URIs. If you append the c-square to http://bioguid.info/csquare:, like this, http://bioguid.info/csquare:3317:364, you get a linked data-friendly URI for the c-square. In a web browser you get a simple web page like this:

A linked data client will get RDF, like this:


<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF 
   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
   xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" 
   xmlns:dcterms="http://purl.org/dc/terms/" 
   xmlns:dwc="http://rs.tdwg.org/dwc/terms/" 
   xmlns:geom="http://fabl.net/vocabularies/geometry/1.1/">
   <dcterms:Location rdf:about="http://bioguid.info/csquare:3307:364">
      <rdfs:label>3307:364</rdfs:label>
      <geom:xmin>74</geom:xmin>
      <geom:ymin>-37</geom:ymin>
      <geom:xmax>75</geom:xmax>
      <geom:ymax>-36</geom:ymax>
      <dwc:footprintWKT>POLYGON((-37 75,-37 74,-36 74,-36 75,-37 75))</dwc:footprintWKT>
   </dcterms:Location>
</rdf:RDF>

Now, I can refer to each square by it's own URI. This will also enable me to query a triple store by c-square (e.g., what other taxa occur within this 1° × 1° square?).

Tony Rees had emailed me about this in response to a tweet about URIs for co-ordinates, but it took me a while to realise how useful c-square notation could be.

Tuesday, August 18, 2009

To wiki or not to wiki?

What follows are some random thoughts as I try and sort out what things I want to focus on in the coming days/weeks. If you don't want to see some wallowing and general procrastination, look away now.

I see four main strands in what I've been up to in the last year or so:

services
mashups
wikis
phyloinformatics

Let's take these in turns.

Services
Not glamourous, but necessary. This is basically bioGUID (see also hdl:10101/npre.2009.3079.1). bioGUID provides OpenURL services for resolving articles (it has nearly 84,000 articles in it's cache), looking up journal names, resolving LSIDs, and RSS feeds.

Mashups
iSpecies is my now aging tool for mashing up data from diverse sources, such as Wikipedia, NCBI, GBIF, Yahoo, and Google Scholar. I tweak it every so often (mainly to deal with Google Scholar forever mucking around with their HTML). The big limitation of iSpecies is that it doesn't make it's results reusable (i.e., you can't write a script to call iSpecies and return data). However, it's still the place I go to to quickly find out about a taxon.

The other mashups I've been playing with focus on taking standardised RSS feeds (provided by bioGUID, see above) and mashing them up, sometimes with a nice front end (e.g., my e-Biosphere 09 challenge entry).

Wiki
I've invested a huge amount of effort in learning how wikis (especially Mediawiki and its semantic extensions) work, documented in earlier posts. I created a wiki of taxonomic names as a sandbox to explore some of these ideas.

I've come to the conclusion that for basic taxonomic and biological information, the only sensible strategy for our community is to use (and contribute to) Wikipedia. I'm struggling to see any justification for continuing with a proliferation of taxonomic databases. After e-Biosphere 09 the game's up, people have started to notice that we've an excess of databases (see Claire Thomas in Science, "Biodiversity Databases Spread, Prompting Unification Call", doi:10.1126/science.324_1632).

Phyloinformatics
In truth I've not been doing much on this, apart from releasing tvwidget (code available from Google Code), and playing with a mapping of TreeBASE studies to bibliographic identifiers (available as a featured download from here). I've played with tvwidget in Mediawiki, and it seems to work quite well.

Where now?
So, where now? Here are some thoughts:

I will continue to hack bioGUID (it's now consuming RSS feeds from journals, as well as Zotero). Everything I do pretty much depends on the services bioGUID provides

iSpecies really needs a big overhaul to serve data in a form that can be built upon. But this requires decisions on what that format should be, so this isn't likely to happen soon. But I think the future of mashup work is to use RDF and triple stores (providing that some degree of editing is possible). I think a tool linking together different data sources (along the lines of my ill-fated Elsevier Challenge entry) has enormous potential.

I'm exploring Wikipedia and Wikispecies. I'm tempted to do a quantitative analysis of Wikipedia's classification. I think there needs to be some serious analysis of Wikipedia if people are going to use it as a major taxonomic resource.

If I focus on Wikipedia (i.e., using an existing wiki rather than try to create my own), then that leaves me wondering what all the playing with iTaxon was for. Well, actually I think the original goal of this blog (way back in December 2005) is ideally suited to a wiki. Pretty much all the elements are in place to dump a copy of TreeBASE into a wiki and open up the editing of links to literature and taxonomic names. I think this is going to handily beat my previous efforts (TbMap, doi:10.1186/1471-2105-8-158), especially as errors will be easy to fix.

So, food for thought. Now, I just need to focus a little and get down to actually doing the work.

Saturday, December 13, 2008

EOL hyperbole

The latest post on the EOL blog (Biodiversity in a rapidly changing world) really, really annoys me. It claims that

The case of the red lionfish exemplfies how EOL can provide information for science-based decision making. Red lionfish are native to coral reef ecosystems in the Indo-Pacific. Yet, probably due to human release of the fish from aquariums, a large population has found itself in the waters near the Bahamas.

Nope, I suggest it demonstrates just how limited EOL is. If I view the page for the red lionfish I get an out of date map from GBIF that shows a very limited distribution, and doesn't show the introductions in Florida and the Bahamas (I have to wade through text to find reference to the Florida introduction, and the page doesn't mention the Bahamas!). The blog entry states that

In this senerio[sic], EOL and its data partners provide up to date information about the lionfish, or pterois[sic] volitans, in a species page.

Well, the GBIF map is old (a more recent map is available from GBIF itself), the bibliography omits key references such as "Biological invasion of the Indo-Pacific lionfish Pterois volitans along the Atlantic coast of North America" (useful reading for a "science-based decision", one would think). Most of this information I got from Wikipedia, GBIF, and Google Scholar via an iSpecies search.

In other words, EOL in it's present state is serving limited, out of date information. The gap between hype and delivery shows no sign of narrowing. How can this help "science-based decision making"? Surely there will come a point when people will tire of breathless statements about how EOL will be useful, and they will start to ask "where's the beef?"

Wednesday, July 30, 2008

iSpecies gets automated tagging

Given that the clones are hot on my heels, I feel the need to add more bells and whistles to iSpecies. The first new feature is automated tagging, and uses Yahoo's Term Extraction API. I send the titles of any papers found, and the Wikipedia snippet, and Yahoo returns keywords ("tags").

As an example, here are the tags for one of my favourite animals, Helice crassa.

mud crabs mangrove estuary muddy sediments mud crab sea coasts mud flats sex ratios habitat preferences activity patterns laboratory conditions estuarine gills burrows endemic respiration original article morphology ventilation dana biology

I think these give a nice sense of what we know about this crab.

I'm storing the tags for future analysis. I think there are some interesting ideas to explore, such as clustering the tags into meaningful groups. I'm also interested in how much we can learn about an organism based on these keywords. Can we automatically infer something about the ecology of the organism?

There is also scope for adding some semantics. Some of these tags are taxon names, and some refer to geographic places. Some are concepts, which could be linked to the relevant page in Wikipedia (Faviki is an example of this approach). At present the tags aren't clickable (i.e., you can't query by tag), but that would be a useful feature. One could get taxa that were tagged with a given term, such as "estuarine". For now, it's a quick way to get a sense of what we know about a taxon.

Friday, July 04, 2008

iSpecies clones, and taxonomic intelligence

Mauro Cavalcanti has released e-Species, "a taxonomically intelligent biodiversity search engine" written in Python that mimics much of the functionality of iSpecies. The project is open source, with a SourceForge page, although no files seem to be available yet. This is the second iSpecies clone I've seen, David Shorthouse having written a clone that uses only JSON.

One thing which distinguishes e-Species is the use of Catalogue of Life web services to provide some information on the name. However, it doesn't look like e-Species makes use of synonyms in its searches (i.e., what many refer to as "taxonomic intelligence"). Searching on two alternative names for the sperm whale (Physeter catodon and P. macrocephalus) yields different results (unless the underlying source knows that these names are synonyms, such as NCBI). Presumably, a taxonomically intelligent search would be able to merge results from searches using different names, and present those together.

Merging results requires some thought as to how to merge lists from different sources (e.g., merging lists of publications and images). This has been the subject of much study in the context of merging results from different search engines. Some starting points are:

The last link is a student project and is a Microsoft Word document, which I've uploaded to Scribd and embedded below.

Tadpole: A Meta search engine - Upload a Document to Scribd

Wednesday, June 11, 2008

More GBIF errors, courtesy of FishBase

Resurrecting iSpecies after moving it to a new folder on one of my servers, and browsing popular searches, I keep coming across clearly erroneous distributions. FishBase seems a major culprit. For example, the common pandora Pagellus erythrinus is a marine fish, yet GBIF displays numerous occurrences in mainland Africa (dots with black centre on map below).

What gives? Well, after struggling with the somewhat non-intuitive GBIF web site I found that the erroneous records are from FishBase. As for the frog example I blogged about earlier, the actual records have locality information indicating most of the records come from the Mediterranean, but the latitude and longitudes are reversed. Swapping these, the records show a more believable distribution (white dots on SVG map below). If you don't see the map, use a decent web browser such as Safari 3 or Firefox 2. If you must use Internet Explorer, grab the RENESIS player.

I know I've harped on about this before, but surely the time is ripe for some clever data cleaning? Especially if users start to loose their trust in GBIF.

Thursday, May 10, 2007

ITIS and DOIs

Following on from my earlier grumble about how the catalogue of Life handles literature, I've spent an afternoon mapping publications in the "itis".publications table in a copy of ITIS to external GUIDs, such as DOIs, Handles, and SICIs in JSTOR. The mapping is not complete by any means, but gives an idea of how many publications have GUIDs.You can view the mapping here. Many of the publications in ITIS are books, which don't have DOIs. A lot of the literature is also old (although this doesn't always mean it won't have a DOI).

Of 4296 records, 324 have DOIs (around 7.5%). Not a lot, but a still a reasonable chunk. At least 700 of the ITIS publications are books (based on having an ISBN), so the percentage is a little higher.

The point of this exercise (following on from my comments on the design flaw in the catalogue of life), is that I think taxonomic databases need to use GUIDs internally to maximise their stability and utility.

Indeed, this is another reason to be disappointed with ZooBank. In addition to a poor way to navigate trees (which prompted me to explore tools such PygmyBrowse), ZooBank does exactly what ITIS and the Catalogue of Life do when it comes to displaying literature -- it displays a text citation (albeit with an invitation to view that record in Zoological Record, a subscription-based service).

For example, the copepod Nitocrellopsis texana was described in ITIS publication 3072, which I've discovered has the the DOI doi:10.1023/A:1003892200897. Given a DOI we have a GUID for the publication, and a direct link to it. In contrast, ZooBank merely gives us:

Nitocrellopsis texana n. sp. from central TX (U.S.A.) and N. ahaggarensis n. sp. from the central Algerian Sahara (Copepoda, Harpacticoida). Hydrobiologia 418 (1-3) 15 January: 82

and a link to Zoological Record. Interesting, even with the resources of ISI behind it, the Zoological Record result doesn't have the DOI.

This for me is one reason ZooBank was so disappointing, it actually provided little of value.

What next? Well, with the 300 or so references mapped to DOIs, one could link those to the ITIS records for the corresponding taxonomic names, and serve these up through somehting like iSpecies, for example. These would be links to the literature, in many cases original descriptions, to supplement the other literature found by iSpecies.