iPhylo: It's been a while...

Roderic D. M. Page

Tuesday, April 06, 2021

It's been a while...

Is it's been a while since I've blogged here. The last few months have been, um, interesting for so many reasons. Meanwhile in my little corner of the world there's been the constant challenge of rethinking how to teach online-only, whilst also messing about with a bunch of things in a rather unfocused way (and spending way too much time populating Wikidata). So here I'll touch on a few rather random topics that have come up in the last few months, and elsewhere on this blog I'll try and focus on some of the things that I'm working on. In many ways this blog post is really to serve as a series of bookmarks for things I'd like to think about a bit more.

Taxonomic precision and the "one tree"

One thing that had been bugging me for a while was my inability to find the source of a quote about taxonomic precision that I remembered as a grad student. I was pretty sure that David Penny and Mike Handy had said it, but where? Found it at last:

Biologists seem to seek “The One Tree” and appear not to be satisfied by a range of options. However, there is no logical difficulty with having a range of trees. There are 34,459,425 possible trees for 11 taxa (Penny et al. 1982), and to reduce this to the order of 10-50 trees is analogous to an accuracy of measurement of approximately one part in 10⁶.

Many measurements in biology are only accurate to one or two significant figures and pale when compared to physical measurements that may be accurate to 10 significant figures. To be able to estimate an accuracy of one tree in 10⁶ reflects the increasing sophistication of tree reconstruction methods. (Note that, on this argument, to identify an organism to a species is also analogous to a measurement with an accuracy of approximately one in 10⁶.). — "Estimating the reliability of evolutionary trees" p.414 doi:10.1093/oxfordjournals.molbev.a040407

I think this quote helps put taxonomy and phylogeny in the broader context of quantitative biology. Building trees that accurately place taxa is a computationally challenging task that yields some of the most precise measurements in biology.

Barcodes for everyone

DNA for 95 specimens in 40 min at negligible cost (not included: beverage consumption during 20 min break): this is the 1st video demonstrating the methods used in “MinION barcodes: biodiversity discovery and identification by everyone, for everyone”. https://t.co/W2LuaJ6SYk pic.twitter.com/eNDk6eU2xa
— Rudolf Meier (@RudolfMeier15) March 22, 2021

This is yet another exciting paper from Rudolf Meier's lab (see earlier blog post Signals from Singapore: NGS barcoding, generous interfaces, the return of faunas, and taxonomic burden). The preprint doi:10.1101/2021.03.09.434692 is on bioRxiv. It feels like we are getting ever-closer to the biodiversity tricorder.

Barcodes for Australia

#ArabaBioscan week 23 https://t.co/n9zvE2lTlF Some of the local #Ichneumonoidea diversity from this week's Malaise sample #BIOSCAN #entomology #DNABarcoding @iBOLConsortium pic.twitter.com/yR7yg0B7Ys
— Donald Hobern (@dhobern) April 5, 2021

Donald Hobern (@dhobern) has been blogging about insects collected in malaise traps in Aranda, Australian Capital Territories (ACT). The insects are being photographed (see stream on Flickr) and will be barcoded.

No barcodes please we're taxonomists!

A paper with a title like "Minimalist revision and description of 403 new species in 11 subfamilies of Costa Rican braconid parasitoid wasps, including host records for 219 species" (Harvey et al. doi:10.3897/zookeys.1013.55600 was always likely to cause problems, and sure enough some taxonomists had a meltdown. A lot of the arguments centered around whether DNA sequences counted as words, which seems surreal. DNA sequences are strings of characters, just like natural language. Unlike English, not all languages have word breaks. Consider Chinese for example, where search engines can't break text up into words for indexing, but instead use n-grams. I mention this simply because n-grams are a useful way to index DNA sequences and to compute sequence similarly without performing a costly sequence alignment. I used this technique in my DNA barcode browser. If we move beyond arguments about whether a picture and a DNA sequence is enough to describe a species (if all species every discovered were described this way we'd arguably be much better off than we are now) I think there is a core issue here, namely the relative size of the intersection between taxa that have been described classically (i.e., with words) and those described almost entirely by DNA (e.g., barcodes) will likely drop as more and more barcoding is done, and this has implications for how we do biology (see Dark taxa: GenBank in a post-taxonomic world).

Bioschema

The dream of linked data rumbles on. Schema.org is having a big impact on standardising basic metadata encoded in web sites, so much so that anyone building a web site now needs to be familiar with schema.org if you want your site to do well in search engine rankings. I made extensive use of schema.org to model bibliographic data on Australian animals for my Ozymandias project.

Bioschemas aims to provide a biology-specific extension to schema.org, and is starting to take off. For example, GBIF pages for species now have schema.org embedded as JSON-LD, e.g. the page for Chrysochloris visagiei Broom, 1950 has this JSON-LD embedded in a <script type="application/ld+json"> tag:

{
  "@context": [
    "https://schema.org/",
    {
      "dwc": "http://rs.tdwg.org/dwc/terms/",
      "dwc:vernacularName": {
        "@container": "@language"
      }
    }
  ],
  "@type": "Taxon",
  "additionalType": [
    "dwc:Taxon",
    "http://rs.tdwg.org/ontology/voc/TaxonConcept#TaxonConcept"
  ],
  "identifier": [
    {
      "@type": "PropertyValue",
      "name": "GBIF taxonKey",
      "propertyID": "http://www.wikidata.org/prop/direct/P846",
      "value": 2432181
    },
    {
      "@type": "PropertyValue",
      "name": "dwc:taxonID",
      "propertyID": "http://rs.tdwg.org/dwc/terms/taxonID",
      "value": 2432181
    }
  ],
  "name": "Chrysochloris visagiei Broom, 1950",
  "scientificName": {
    "@type": "TaxonName",
    "name": "Chrysochloris visagiei",
    "author": "Broom, 1950",
    "taxonRank": "SPECIES",
    "isBasedOn": {
      "@type": "ScholarlyArticle",
      "name": "Ann. Transvaal Mus. vol.21 p.238"
    }
  },
  "taxonRank": [
    "http://rs.gbif.org/vocabulary/gbif/rank/species",
    "species"
  ],
  "dwc:vernacularName": [
    {
      "@language": "eng",
      "@value": "Visagie s Golden Mole"
    },
    {
      "@language": "eng",
      "@value": "Visagie's Golden Mole"
    },
    {
      "@language": "eng",
      "@value": "Visagie's Golden Mole"
    },
    {
      "@language": "eng",
      "@value": "Visagie's Golden Mole"
    },
    {
      "@language": "",
      "@value": "Visagie's golden mole"
    },
    {
      "@language": "eng",
      "@value": "Visagie's Golden Mole"
    },
    {
      "@language": "deu",
      "@value": "Visagie-Goldmull"
    }
  ],
  "parentTaxon": {
    "@type": "Taxon",
    "name": "Chrysochloris Lacépède, 1799",
    "scientificName": {
      "@type": "TaxonName",
      "name": "Chrysochloris",
      "author": "Lacépède, 1799",
      "taxonRank": "GENUS",
      "isBasedOn": {
        "@type": "ScholarlyArticle",
        "name": "Tabl. Mamm. p.7"
      }
    },
    "identifier": [
      {
        "@type": "PropertyValue",
        "name": "GBIF taxonKey",
        "propertyID": "http://www.wikidata.org/prop/direct/P846",
        "value": 2432177
      },
      {
        "@type": "PropertyValue",
        "name": "dwc:taxonID",
        "propertyID": "http://rs.tdwg.org/dwc/terms/taxonID",
        "value": 2432177
      }
    ],
    "taxonRank": [
      "http://rs.gbif.org/vocabulary/gbif/rank/genus",
      "genus"
    ]
  }
}

For more details on the potential of Bioschemas see Franck Michel's TDWG Webinar.

OCR correction

Just a placeholder to remind me to revisit OCR correction and the dream of a workflow to correct text for BHL. I came across hOCR-Proofreader (which has a Github repo). Internet Archive now provides hOCR files as one of its default outputs, so we're getting closer to a semi-automated workflow for OCR correction. For example, imagine having all this set up on Github so that people can correct text and push those corrections to Github. So close...

Roger Hyam keeps being awesome

Roger just keeps doing cool things that I keep learning from. In the last few months he's been working on a nice interface to the World Flora Online (WFO) which, let's face it, is horrifically ugly and does unspeakable things to the data. Roger is developing a nicer interface and is doing some cool things under the hood with identifiers that inspired me to revisit LSIDs (see below).

But the other thing Roger has been doing is using GraphQL to provide a clean API for the designer working with him to use. I have avoided GraphQL because it couldn't see what problem it solved. It's not a graph query language (despite the name), it breaks HTTP caching, it just seemed that it was the SOAP of today. But, if Roger's using it, I figured there must be something good here (and yes, I'm aware that GraphQL has a huge chunk of developer mindshare). As I was playing with yet another knowledge graph project I kept running into the challenge of converting a bunch of SPARQL queries into something that could be easily rendered in a web page, which is when the utility of GraphQL dawned on me. The "graph" in this case is really a structured set of results that correspond to the information you want to render on a web page. This may be the result of quite a complex series of queries (in my case using SPARQL on a triple store) that nobody wants to actually see. The other motivator was seeing DataCite's use of GraphQL to query the "PID Graph". So, I think I get it now, in the sense that I see why it is useful.

LSIDs back from the dead

LSIDs are back baby! https://t.co/gWoBoY1wgn Persistent identifiers should, you know, persist #PID pic.twitter.com/RFW723DnVV
— Roderic Page (@rdmpage) March 9, 2021

In part inspired by Roger Hyam's work on WFO I released a Life Science Identifier (LSID) Resolver to make LSIDs resolvable. I'll spare you the gory details, but you can think of LSIDs as DOIs for taxonomic names. They came complete with a decentralised resolution mechanism (based on Internet domain names) and standards for what information they return (RDF as XML), and millions were minted for animal, fungi, and plant names. For various reasons they didn't really take off (they were technically tricky to use and didn't return information in a form people could read, so what were the chances?). Still, they contain a lot of valuable information for those of us interested in having lists of names linked to the primary literature. Over the years I have been collecting them and wanted a way to make them available. I've chosen a super-simple approach based on storing them in compressed form in GitHub and wrapping that repo in simple web site. Lots of limitations, but I like the idea that LSIDs actually, you know, persist.

DOIs for Biodiversity Heritage Library

Six months ago we started BHL's Persistent Identifier Working Group, so it's time for a huge shout out to the amazing efforts of @mlichtenberg @SusanWLynch @fauxbrarian @missellb @BHLProgramMgr & @rdmpage to mint DOIs for the historic literature on @BioDivLibrary. #RetroPIDs pic.twitter.com/s4jJBUEQg9
— Nicole Kearney (@nicolekearney) April 1, 2021

In between everything else I've been working with BHL to add DOIs to the literature that they have scanned. Some of this literature is old and of limited scientific value (but sure looks pretty - Nicole Kearney is going to take me to task for saying that), but a lot of it is recent, rich, and scientifically valuable. I'm hoping that the coming months will see a lot of this literature emerge from relative obscurity and become a first class citizen of the taxonomic and biodiversity literature.

Summary

I guess something meaningful and deep should go here... nope, I'm done.