Tuesday, April 06, 2021

It's been a while...

Is it's been a while since I've blogged here. The last few months have been, um, interesting for so many reasons. Meanwhile in my little corner of the world there's been the constant challenge of rethinking how to teach online-only, whilst also messing about with a bunch of things in a rather unfocused way (and spending way too much time populating Wikidata). So here I'll touch on a few rather random topics that have come up in the last few months, and elsewhere on this blog I'll try and focus on some of the things that I'm working on. In many ways this blog post is really to serve as a series of bookmarks for things I'd like to think about a bit more.

Taxonomic precision and the "one tree"

One thing that had been bugging me for a while was my inability to find the source of a quote about taxonomic precision that I remembered as a grad student. I was pretty sure that David Penny and Mike Handy had said it, but where? Found it at last:

Biologists seem to seek “The One Tree” and appear not to be satisfied by a range of options. However, there is no logical difficulty with having a range of trees. There are 34,459,425 possible trees for 11 taxa (Penny et al. 1982), and to reduce this to the order of 10-50 trees is analogous to an accuracy of measurement of approximately one part in 106.

Many measurements in biology are only accurate to one or two significant figures and pale when compared to physical measurements that may be accurate to 10 significant figures. To be able to estimate an accuracy of one tree in 106 reflects the increasing sophistication of tree reconstruction methods. (Note that, on this argument, to identify an organism to a species is also analogous to a measurement with an accuracy of approximately one in 106.). — "Estimating the reliability of evolutionary trees" p.414 doi:10.1093/oxfordjournals.molbev.a040407

I think this quote helps put taxonomy and phylogeny in the broader context of quantitative biology. Building trees that accurately place taxa is a computationally challenging task that yields some of the most precise measurements in biology.

Barcodes for everyone

This is yet another exciting paper from Rudolf Meier's lab (see earlier blog post Signals from Singapore: NGS barcoding, generous interfaces, the return of faunas, and taxonomic burden). The preprint doi:10.1101/2021.03.09.434692 is on bioRxiv. It feels like we are getting ever-closer to the biodiversity tricorder.

Barcodes for Australia

Donald Hobern (@dhobern) has been blogging about insects collected in malaise traps in Aranda, Australian Capital Territories (ACT). The insects are being photographed (see stream on Flickr) and will be barcoded.

No barcodes please we're taxonomists!

A paper with a title like "Minimalist revision and description of 403 new species in 11 subfamilies of Costa Rican braconid parasitoid wasps, including host records for 219 species" (Harvey et al. doi:10.3897/zookeys.1013.55600 was always likely to cause problems, and sure enough some taxonomists had a meltdown. A lot of the arguments centered around whether DNA sequences counted as words, which seems surreal. DNA sequences are strings of characters, just like natural language. Unlike English, not all languages have word breaks. Consider Chinese for example, where search engines can't break text up into words for indexing, but instead use n-grams. I mention this simply because n-grams are a useful way to index DNA sequences and to compute sequence similarly without performing a costly sequence alignment. I used this technique in my DNA barcode browser. If we move beyond arguments about whether a picture and a DNA sequence is enough to describe a species (if all species every discovered were described this way we'd arguably be much better off than we are now) I think there is a core issue here, namely the relative size of the intersection between taxa that have been described classically (i.e., with words) and those described almost entirely by DNA (e.g., barcodes) will likely drop as more and more barcoding is done, and this has implications for how we do biology (see Dark taxa: GenBank in a post-taxonomic world).

Bioschema

The dream of linked data rumbles on. Schema.org is having a big impact on standardising basic metadata encoded in web sites, so much so that anyone building a web site now needs to be familiar with schema.org if you want your site to do well in search engine rankings. I made extensive use of schema.org to model bibliographic data on Australian animals for my Ozymandias project.

Bioschemas aims to provide a biology-specific extension to schema.org, and is starting to take off. For example, GBIF pages for species now have schema.org embedded as JSON-LD, e.g. the page for Chrysochloris visagiei Broom, 1950 has this JSON-LD embedded in a <script type="application/ld+json"> tag:

{ "@context": [ "https://schema.org/", { "dwc": "http://rs.tdwg.org/dwc/terms/", "dwc:vernacularName": { "@container": "@language" } } ], "@type": "Taxon", "additionalType": [ "dwc:Taxon", "http://rs.tdwg.org/ontology/voc/TaxonConcept#TaxonConcept" ], "identifier": [ { "@type": "PropertyValue", "name": "GBIF taxonKey", "propertyID": "http://www.wikidata.org/prop/direct/P846", "value": 2432181 }, { "@type": "PropertyValue", "name": "dwc:taxonID", "propertyID": "http://rs.tdwg.org/dwc/terms/taxonID", "value": 2432181 } ], "name": "Chrysochloris visagiei Broom, 1950", "scientificName": { "@type": "TaxonName", "name": "Chrysochloris visagiei", "author": "Broom, 1950", "taxonRank": "SPECIES", "isBasedOn": { "@type": "ScholarlyArticle", "name": "Ann. Transvaal Mus. vol.21 p.238" } }, "taxonRank": [ "http://rs.gbif.org/vocabulary/gbif/rank/species", "species" ], "dwc:vernacularName": [ { "@language": "eng", "@value": "Visagie s Golden Mole" }, { "@language": "eng", "@value": "Visagie's Golden Mole" }, { "@language": "eng", "@value": "Visagie's Golden Mole" }, { "@language": "eng", "@value": "Visagie's Golden Mole" }, { "@language": "", "@value": "Visagie's golden mole" }, { "@language": "eng", "@value": "Visagie's Golden Mole" }, { "@language": "deu", "@value": "Visagie-Goldmull" } ], "parentTaxon": { "@type": "Taxon", "name": "Chrysochloris Lacépède, 1799", "scientificName": { "@type": "TaxonName", "name": "Chrysochloris", "author": "Lacépède, 1799", "taxonRank": "GENUS", "isBasedOn": { "@type": "ScholarlyArticle", "name": "Tabl. Mamm. p.7" } }, "identifier": [ { "@type": "PropertyValue", "name": "GBIF taxonKey", "propertyID": "http://www.wikidata.org/prop/direct/P846", "value": 2432177 }, { "@type": "PropertyValue", "name": "dwc:taxonID", "propertyID": "http://rs.tdwg.org/dwc/terms/taxonID", "value": 2432177 } ], "taxonRank": [ "http://rs.gbif.org/vocabulary/gbif/rank/genus", "genus" ] } }

For more details on the potential of Bioschemas see Franck Michel's TDWG Webinar.

OCR correction

Just a placeholder to remind me to revisit OCR correction and the dream of a workflow to correct text for BHL. I came across hOCR-Proofreader (which has a Github repo). Internet Archive now provides hOCR files as one of its default outputs, so we're getting closer to a semi-automated workflow for OCR correction. For example, imagine having all this set up on Github so that people can correct text and push those corrections to Github. So close...

Roger Hyam keeps being awesome

Roger just keeps doing cool things that I keep learning from. In the last few months he's been working on a nice interface to the World Flora Online (WFO) which, let's face it, is horrifically ugly and does unspeakable things to the data. Roger is developing a nicer interface and is doing some cool things under the hood with identifiers that inspired me to revisit LSIDs (see below).

But the other thing Roger has been doing is using GraphQL to provide a clean API for the designer working with him to use. I have avoided GraphQL because it couldn't see what problem it solved. It's not a graph query language (despite the name), it breaks HTTP caching, it just seemed that it was the SOAP of today. But, if Roger's using it, I figured there must be something good here (and yes, I'm aware that GraphQL has a huge chunk of developer mindshare). As I was playing with yet another knowledge graph project I kept running into the challenge of converting a bunch of SPARQL queries into something that could be easily rendered in a web page, which is when the utility of GraphQL dawned on me. The "graph" in this case is really a structured set of results that correspond to the information you want to render on a web page. This may be the result of quite a complex series of queries (in my case using SPARQL on a triple store) that nobody wants to actually see. The other motivator was seeing DataCite's use of GraphQL to query the "PID Graph". So, I think I get it now, in the sense that I see why it is useful.

LSIDs back from the dead

In part inspired by Roger Hyam's work on WFO I released a Life Science Identifier (LSID) Resolver to make LSIDs resolvable. I'll spare you the gory details, but you can think of LSIDs as DOIs for taxonomic names. They came complete with a decentralised resolution mechanism (based on Internet domain names) and standards for what information they return (RDF as XML), and millions were minted for animal, fungi, and plant names. For various reasons they didn't really take off (they were technically tricky to use and didn't return information in a form people could read, so what were the chances?). Still, they contain a lot of valuable information for those of us interested in having lists of names linked to the primary literature. Over the years I have been collecting them and wanted a way to make them available. I've chosen a super-simple approach based on storing them in compressed form in GitHub and wrapping that repo in simple web site. Lots of limitations, but I like the idea that LSIDs actually, you know, persist.

DOIs for Biodiversity Heritage Library

In between everything else I've been working with BHL to add DOIs to the literature that they have scanned. Some of this literature is old and of limited scientific value (but sure looks pretty - Nicole Kearney is going to take me to task for saying that), but a lot of it is recent, rich, and scientifically valuable. I'm hoping that the coming months will see a lot of this literature emerge from relative obscurity and become a first class citizen of the taxonomic and biodiversity literature.

Summary

I guess something meaningful and deep should go here... nope, I'm done.

Saturday, October 24, 2020

Visualising article coverage in the Biodiversity Heritage Library

It's funny how some images stick in the mind. A few years ago Chris Freeland (@chrisfreeland), then working for Biodiversity Heritage Library (BHL), created a visualisation of BHL content relevant to the African continent. It's a nice example of small multiples.


For more than a decade (gulp) I've been extracting articles from the BHL and storing them in BioStor. My approach is to locate articles based on metadata (e.g., information on title, volume, and pagination) and store the corresponding set of BHL pages in BioStor. BHL in turn regularly harvests this information and displays these articles as "parts" on their web site. Most of this process is semi-automated, but still requires a lot of manual work. One thing I've struggled with is getting a clear sense of how much progress has been made, and how much remains to be done. This has become more pressing given work I'm doing with Nicole Kearney (@nicolekearney) on Australian content. Nicole has a team of volunteers feeding me lists of article metadata for journals relevant to Australian biodiversity, and it would be nice to see where the remaining gaps are.

So, motivated by this, and also a recent comment by Nicky Nicolson (@nickynicolson) about the lack of such visualisations, I've put together a crude tool to try and capture the current state of coverage. You can see the results here: https://rdmpage.github.io/bhl-article-coverage/.

As an example, here is v.25:pt.2 (1988) of Memoirs of the Queensland Museum. Each contiguous block of colour highlights an article in this volume:


This scanned item-level view is constructed for each item (typically a volume or an issue). I then generate a PNG bitmap thumbnail of this display for each volume, and display them together in a page for the corresponding journal (e.g., Memoirs of the Queensland Museum):

So at a glance we can see the coverage for a journal. Gray represents pages that have not been assigned to an article, so if you want to add articles to BHL those are the volumes to focus on.

There's an argument to be made that it is crazy to spend a decade extracting some 226,000 articles semi-automatically. Ideally we could use a tool like machine learning to identify articles in BHL. It would be a huge time saver if we could simply run a BHL article through a tool that could (a) extract article-level metadata and (b) associate that with the corresponding set of scanned pages. Among the possible approaches would be to develop a classifier that would assign each page in a scanned volume to a category such as "article start", "article end", "article", "plate", etc. In effect, we want a tool that can could segment scans into articles (and hence reproduce the coverage diagrams shown above) simply based on attributes of the page images. This doesn't solve the entire problem, we still need to extract metadata (e.g., article titles), but it would be a start. However, it poses the classic dilemma, do I keep doing this manually, or do I stop adding articles and take the time to learn a new technology in the hope that eventually I will end up adding more articles than if I'd persisted with the manual approach?


Wednesday, October 21, 2020

GBIF Challenge success

Somewhat stunned by the fact that my DNA barcode browser I described earlier was one of the (minor) prizewinners in this year's GBIF Ebbe Nielsen Challenge. For details on the winner and other place getters see ShinyBIOMOD wins 2020 GBIF Ebbe Nielsen Challenge. Obviously I'm biased, but it's nice to see the challenge inspiring creativity in biodiversity informatics. Congratulations to everyone who took part.

My entry is live at https://dna-barcode-browser.herokuapp.com. I had a difficult time keeping it online over the summer due to meow attacks, but managed to sort that out. Turns out the cloud provider I used to host Elasticsearch switched from securing the server by default to making it completely open to anyone, and I'd missed that change.

Given that the project was a success, I'm tempted to revisit it and explore further ways to combine phylogenetic trees in a biodiversity browser.


Wednesday, September 23, 2020

Using the hypothes.is API to annotate PDFs

With somewhat depressing regularity I keep cycling back to things I was working on earlier but never quite get to work the way I wanted. The last couple of days it's the turn of hypothes.is.

One of the things I'd like to have is a database of all taxonomic names such that if you clicked on a name you would get not only the bibliographic record for the publication where that name first appeared (which is what I've bene building for animals in BioNames) but also you could see the actual publication with the name highlighted in the text. This assumes that the publication has been digitised (say, as a PDF) and is accessible, but let's assume that this is the case. Now, we could do this manually, but we have tools to find taxonomic names in text. And in my use case I often know which page the name is on, and what the name is, so all I really want is to be able to highlight it programmatically (because I have millions of names to deal with).

So, time to revisit the hypothes.is API. One of the neat "tricks" hypothes.is have managed is the ability to annotate, say, a web page for an article and have that annotation automagically appear on the PDF version of the same article. As described in How Hypothesis interacts with document metadata this is in part because hypothes.is extracts metadata from the article's web page, such as DOI and link to the PDF, and stores that with the annotation (I say "in part" because the other part of the trick is to be able to locate annotations in different versions of the same text). If you annotate a PDF, hypothes.is stores the URL of the PDF and also a "fingerprint" of the PDF (see PDF Fingerprinting for details). This means that you can also add an annotation to a PDF offline (for example, on a file you have downloaded onto your computer) and - if hypothes.is has already encountered this PDF - that annotation will appear in the PDF online.

What I want to do is have a PDF, highlight the scientific name, upload that annotation to hypothes.is so that the annotation is visible online when anyone opens the PDF (and ideally when they look at the web version of the same article). I want to do this programmatically. Long story short, this seems doable. Here is an example annotation that I created and sent to hypothesis.is via their API:

{
    "uri": "http://www.scielo.br/pdf/isz/v105n3/1678-4766-isz-105-03-00372.pdf",
    "document": {
        "highwire": {
            "doi": [
                "10.1590/1678-476620151053372375"
            ]
        },
        "dc": {
            "identifier": [
                "doi:10.1590/1678-476620151053372375"
            ]
        },
        "link": [
            {
                "href": "urn:x-pdf:6124e7bdb33241429158b11a1b2c4ba5"
            }
        ]
    },
    "tags": [
        "api"
    ],
    "target": [
        {
            "source": "http://www.scielo.br/pdf/isz/v105n3/1678-4766-isz-105-03-00372.pdf",
            "selector": [
                {
                    "type": "TextQuoteSelector",
                    "exact": "Alpaida venger sp. nov.",
                    "prefix": "imens preserved in 75% ethanol. ",
                    "suffix": " (Figs 1-9) Type-material. Holot"
                },
                {
                    "type": "TextPositionSelector",
                    "start": 4834,
                    "end": 4857
                }
            ]
        }
    ],
    "user": "acct:xxx@hypothes.is",
    "permissions": {
        "read": [
            "group:__world__"
        ],
        "update": [
            "acct:xxx@hypothes.is"
        ],
        "delete": [
            "acct:xxx@hypothes.is"
        ],
        "admin": [
            "acct:xxx@hypothes.is"
        ]
    }
}

In this example the article and the PDF are linked by including the DOI and PDF fingerprint in the same annotation (thinking about this I should probably also have included the PDF URL in document.highwire.pdf_url[]). I extracted the PDF fingerprint using mutool and added that as the urn:x-pdf identifier.

The actual annotation itself is described twice, once using character position (start and end of the text string relative to the cleaned text extracted from the PDF) and once by including short fragments of text before and after the bit I want to highlight (Alpaida venger sp. nov.). In my limited experience so far this combination seems to provide enough information for hypothes.is to also locate the annotation in the HTML version of the article (if one exists).

You can see the result for yourself using the hypothes.is proxy (https://via.hypothes.is). Here is the annotation on the PDF (https://www.scielo.br/pdf/isz/v105n3/1678-4766-isz-105-03-00372.pdf)


and here is the annotation on HTML (https://doi.org/10.1590/1678-476620151053372375)



If you download the PDF onto your computer and open the file in Chrome you can also see the annotation in the PDF (to do this you will need to install the hypothes.is extension for Chrome and click the hypothes.is symbol on your Chrome's toolbar).

In summary, we have a pretty straightforward way to automatically annotate papers offline using just the PDF.

Friday, September 11, 2020

Darwin Core Million reminder, and thoughts on bad data

Bob mesibovThe following is a guest post by Bob Mesibov.

No winner yet in the second Darwin Core Million for 2020, but there are another two and a half weeks to go (to 30 September). For details of the contest see this iPhylo blog post. And please don’t submit a million RECORDS, just (roughly) a million DATA ITEMS. That’s about 20,000 records with 50 fields in the table, or about 50,000 records with 20 fields, or something arithmetically similar.


The purpose of the Darwin Core Million is to celebrate high-quality occurrence datasets. These are extraordinarily rare in biodiversity informatics.

I’ll unpick that. I’m not talking about the accuracy of the records. For most records, the “what”, “where”, “when” and “by whom” are probably correct. An occurrence record is a simple fact: Wilma Flintstone collected a flowering specimen of an Arizona Mountain Dandelion 5 miles SSE of Walker, California on 27 June 2019. More technically, she collected Agoseris parviflora at 38.4411 –119.4393, as recorded by her handheld GPS.

What could possibly go wrong in compiling a dataset of simple records like that in a spreadsheet or database? Let me count a few of the ways:

  • data items get misspelled or misnumbered
  • data items get put in the wrong field
  • data items are put in a field for which they are invalid or inappropriate
  • data items that should be entered get left out
  • data items get truncated
  • data items contain information better split into separate fields
  • data items contain line breaks
  • data items get corrupted by copying down in a spreadsheet
  • data items disagree with other data items in the same record
  • data items refer to unexplained entities (“habitat type A”)
  • paired data items don’t get paired (e.g. latitude but no longitude)
  • the same data item appears in different formats in different records
  • missing data items are represented by blanks, spaces, “?”, “na”, “-”, “unknown”, “not recorded” etc, all in the same data table
  • character encoding failures create gibberish, question marks and replacement characters (�)
  • weird control characters appear in data items, and parsing fails
  • dates get messed up (looking at you, Excel)
  • records get duplicated after minor edits

In previous blog posts (here and here) I’ve looked at explanations for poor-quality data at the project, institution and agency level — data sources I referred to collectively as the “PIA”. I don’t think any of those explanations are controversial. Here I’m going to be rude and insulting and say there are three further obstacles to creating good, usable and shareable occurrence data:

Datasets are compiled as though they were family heirlooms.

The PIA says “This database is OUR property. It’s for OUR use and WE understand the data, even if it’s messy and outsiders can’t figure out what we’ve done. Ambiguities? No problem, we’ll just email Old Fred. He retired a few years back but he knows the system back to front.”

Prising data items from these heirlooms, mapping them to new fields and cleaning them are complicated exercises best left to data specialists. That’s not what happens.

Datasets are too often compiled by people with inadequate computer skills. Their last experience of data management was building a spreadsheet in a “digital learning” class. They’re following instructions but they don’t understand them. Both the data enterers and their instructors are hoping for a good result, which is truly courageous optimism.

The (often huge) skills gap between the compilers of digital PIA data and the computer-savvy people who analyse and reformat/repackage the data (users and facilitators-for-users) could be narrowed programmatically, but isn’t. Hands up all those who use a spreadsheet for data entry by volunteers and have comprehensive validation rules for each of the fields? Thought so.

People confuse software with data. This isn’t a problem restricted to biodiversity informatics, and I’ve ranted about this issue elsewhere. The effect is that data compilers blame software for data problems and don’t accept responsibility for stuff-ups.

Sometimes that blaming is justified. As a data auditor I dread getting an Excel file, because I know without looking that the file will have usability and shareability issues on top of the usual spreadsheet errors. Excel isn’t an endpoint in a data-use pipeline, it’s a starting point and a particularly awful one.

Another horror is the export option. Want to convert your database of occurrence records to format X? Just go to the “Save as” or “Export data” menu item and click “OK”. Magic happens and you don’t need to check the exported file in format X to see that all is well. If all is not well, it’s the software’s fault, right? Not your problem.

In view of these and the previously blogged-about explanations for bad data, it’s a wonder that there are any high-quality datasets, but there are. I’ve audited them and it’s a shame that for ethical reasons I can’t enter them myself in the current Darwin Core Million.

Wednesday, August 26, 2020

Personal knowledge graphs: Obsidian, Roam, Wikidata, and Xanadu


I stumbled across this tweet yesterday (no doubt when I should have been doing other things), and disappeared down a rabbit hole. Emerging, I think the trip was worth it.

 

Markdown wikis


Among the tools listed by @zackfan01 were Obsidian and Roam, neither of which I heard of before. Both Obsidian and Roam are pitched as "note-taking" apps, they are essentially personal wikis where you write text in Markdown and use [[some text goes here]] to create links to other pages (very like a Wiki). Both highlight backlinks, that is, clearly displaying "what links here" on each page, making it easy to navigate around the graph you are creating by linking pages. Users of Obsidian share these graphs in Discord, rather like something from Martin MacInnes' novel "Gathering Evidence". Personal wikis have been around for a long time, but these apps are elegantly designed and seem fun to use. Looking at these apps I'm reminded of my earlier post Notes on collections, knowledge graphs, and Semantic Web browsers where I moaned about the lack of personal knowledge graphs that supported inference from linked data. I'm also reminded of the Blue Planet II, the BBC, and the Semantic Web: a tale of lessons forgotten and opportunities lost where I constructed an interactive tool to navigate BBC data on species and their ecology (you can see this live at https://rdmpage.github.io/bbc-wildlife/www/), and the fun to be had from simply being able to navigate around a rich set of links. I imagine these Markdown-based wikis could be a great way further explore these ideas.


 

Personal and global knowledge graphs


Then I began thinking about what if the [[page links]] in these personal knowledge graphs were not just some text but, say, a Wikidata identifier (of the form "Qxxxxx")? Imagine that if you were writing notes on say, a species, you could insert the Wikidata Qid and you would get a pre-populated template that comes with some facts from Wikidata, and you could then use that as a starting point (see for example Toby Hudson Entity Explosion I discussed earlier). Knowing that more and more scholarly papers are being added to Wikidata, this means you could also add bibliographic citations as Qids, fetching all the necessary bibliographic information on the fly from Wikidata. So your personal knowledge graph intersects with the global graph.

 

Roam


Now, I've not used Roam, but anyone who has is likely to balk at my characterisation of it as "just" a Markdown wiki, because there's more going on here. The Roam white paper talks about making inferences from the text using reasoning or belief networks, although these features don't seem to have had much uptake. But what really struck me as I explored Roam was the notion of not just linking to pages using the [[ ]] syntax, but also linking to parts of pages (blocks) using (( )). In the demo of Roam there are various essays, such as Paul Graham's The Refragmentation, and each paragraph is an addressable block that can be cited independently of the entire essay. Likewise, you can see what pages cite that block.


 Now in a sense these are just like fragment identifiers that we can use to link to parts of a web page, but there's something more here because these fragments are not just locations in a bigger document, they are the components of the document.

Xanadu


This strikes me as rather like Ted Nelson's vision of Xanadu, where you could cite any text at any level of granularity, and that text would be incorporated into the document you were creating via transclusion (i.e., you don't include a copy of the text, you include the actual text). In the context of Roam, this means you have the entire text you want to cite included in the system, so you can then show chunks of it and build up a network of ideas around each chunk. This also means that the text being worked on becomes part of the system, rather than remaining isolated, say, as a PDF or other representation. This also got me thinking about the Plazi project, where taxonomic papers are being broken into component chunks (e.g., figures, taxonomic descriptions, etc.) and these are then being stored in various places and reassembled - rather like Frankenstein's monster - in new ways, for example in GBIF (e.g., https://www.gbif.org/species/166240579) or Species-ID (see doi:10.3897/zookeys.90.1369 ). One thing I've always found a little jarring about this approach is that you lose the context of the work that each component was taken from. Yes, you can find a link to the original work and go there, but what if you could seamlessly click on the paragraph or figure see them as part of the original article? Imagine we had all the taxonomic literature available in this way, so that we can cite any chunk, remix it (which is a key part of floras and other taxonomic monographs), but still retain the original context?

Summary


To come back full circle, in some ways tools like Obsidian and Roam are old hat, we've had wikis for a while, the idea of loading texts into wikis is old (e.g., Wikisource), backlinks are nothing new, etc. But there's something about seeing clean, elegant interpretations of these ideas, free of syntax junk, and accompanied by clear visions of how software can help us think. I'm not sure I will use either app, but they have given me a lot of food for thought.

Tuesday, August 25, 2020

Entity Explosion: bringing Wikidata to every website

A week ago Toby Hudson (@tobyhudson) released a very cool Chrome (and now Firefox) extension called Entity Explosion. If you install the extension, you get a little button you can press to find out what Wikidata knows about the entity on the web page you are looking at. The extension works on web sites that have URLs that match identifiers in Wikidata. For example, here it is showing some details for an article in BioStor (https://biostor.org/reference/261148). The extension "knows" that this article is about the Polish arachnologist Wojciech Staręga.



But this is a tame example, see what fun Dario Taraborelli (@ReaderMeter) is having with Toby's extension:

There are some limitations. For instance, it requires that the web site URL matches the identifier, or more precisely the URL formatter for that identifier. In the case of BioStor the URL formatter URL is https://biostor.org/reference/$1 where $1 is the BioStor identifier stored by Wikidata (e.g., 261148). So, if you visit https://biostor.org/reference/261148 the extension works as advertised.

However, identifiers that are redirects to other web sites, such as DOIs, aren't so lucky. A Wikidata item with a DOI (such as 10.1371/JOURNAL.PONE.0133602) corresponds to the URL https://doi.org/10.1371/JOURNAL.PONE.0133602, but if you click on that URL eventually you get taken to https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0133602, which isn't the original DOI URL (incidently this is exactly how DOIs are supposed to work).

So, it would be nice if Entity Explosion would also read the HTML for the web page and attempt to extract the DOI from that page (for notes on this see https://www.wikidata.org/wiki/Wikidata_talk:Entity_Explosion#Retrieve_identifiers_from_HTML_<meta>_tag), which means it would work on even more webs sites for academic articles.

Meantime, if you use Chrome of Firefox as your browser, grab a copy and discover just how much information Wikidata has to offer.

Workshop On Open Citations And Open Scholarly Metadata 2020 talk

I'm giving a short talk at the Workshop On Open Citations And Open Scholarly Metadata 2020, which will be held online on September 9th. In the talk I touch on citation patterns in the taxonomic literature, the recent Zootaxa impact factor story, and mention a few projects I'm working on: To create the presentation I played around with mmhmm, which (pretty obviously) I still need to get the hang of...

Anyway, video below: