iPhylo: May 2017

Roderic D. M. Page

Wednesday, May 31, 2017

Programming with Glitch: microservices and serverless computing

LgbNpkq 400x400 Yes, this post is indeed an attempt to fit as many buzzwords that I don't really understand into the title. I've been playing around with Glitch, which is a delightful project from Fog Creek (makers of Trello and co-creators of Stack Overflow).

On first glance Glitch looks weirdly retro, and it took a little while for me to get the hang of things. Bit it's fun and very powerful. Basically it's a place where you can start creating web apps in your browser, and each app is automatically hosted online. If you see an app that you like you can see the source code (just like you can see HTML using "view source" in your browser). if you want to hack on the code you can simply create a copy and it's yours to play with (this is called "remixing", like forking on GitHub). Your copy gets a cute name (possibly annoyingly cute) and away you go.

If you're a developer, then at this point you're probably wondering what is actually happening under the hood. Each Glitch app is a node.js app, which means you're programming in Javascript (you can just use HTML and client side Javascript if you want to avoid node.js). I'm very new to node.js, so Glitch has been a fun way to experiment.

There are two things which make Glitch very powerful. The first is the "remix" feature. Don't know where to start? Find an app that looks like it might do something you want to do, remix it, and hack away. The code is edited online, and the editor works very well. It also checks your code for Javascript errors as you type, which is helpful (usually).

The second great feature is that you get built in hosting for free. As soon as you remix an app you have a functioning web site. Remixing is very like forking in GitHub, and if you're running node.js on your local machine then the benefits of Glitch might not seem obvious. But hosting is often a pain, either you need to set up your own servers, or use a hosting service. Glitch takes care of this for you, so your app is instantly available for others to use.

So, what can you do with Glitch? There's some great examples on the Glitch site, but I want to show an almost trivial example. I've created an app called "enchanting-bongo" https://enchanting-bongo.glitch.me (yes, the name is a bit irritating) that does one simple thing. You give it a DOI for an article and enchanting-bongo tells you whether any of the authors of that work have an ORCID. For example, try the DOI 10.3897/zookeys.555.6173. Why did I write this? I'm interested in ways to link people to the work that they've done, especially work that ends up being aggregated in large-scale biodiversity databases like GBIF (see Possible project: #itaxonomist, combining taxonomic names, DOIs, and ORCID to measure taxonomic impact).

The app does one thing. It takes the DOI and calls the ORCID API to see if anyone has claimed authorship of the paper with that DOI. You can use the app with a web browser, or you can use an HTTP client and call the API (e.g., https://enchanting-bongo.glitch.me/search?q=10.3897%2Fzookeys.555.6173).

Glitch is an example of servers computing, where you don't have to worry about physical servers or the software infrastructure that runs on them (e.g., the web server itself), you just write code. Like any buzzword, there is some pushback, see for example What Is “Serverless”? An Alternative Take, but for a fascinating essay I recommend Why the fuss about serverless?. But the notion that I can simply hack away on some code and have an instantly available web app is very attractive.

The other buzzword is "microservices". I'm forever needing to do tasks such as find a DOI for a paper, match a "microcitation" to the enclosing article, locate a specimen in GBIF based on catalogue number in a paper, parse some text into structured data, such as a reference, geographic coordinates, etc. These are tools that I need in lots of contexts, and I've written software to do this on my machine, often as part of larger projects. "Microservices" is the idea that instead of large, monolithic apps we write a series of minimal tools that typically do one thing, and do it well. We then chain the together to do various tasks. Having small tools means that we can treat each problem independently, and if the tools communicate over the web (HTTP) then it doesn't matter what programming language we use. I've started thinking more and more about adopting this model and developing a bunch of small services to perform many of the tasks I need. Hosting these services then becomes in issue, I have web servers in my office but they are a pain to maintain (my university is forever insisting that I upgrade their software), so cloud-based hosting seems the obvious way forward. Free-hosting looks ideal, so Glitch is looking very attractive.

So, I'm hoping to experiment more with this approach. One thing I might do is create a series of services very like enchanting-bongo, have a simple web interface and an API that the web interface calls. That way users can play with it in their web browser, then call the service via the API if it does something useful. As a more sophisticated example of a service, I'm working on tools to parse Wikispecies reference strings, and link specimen codes to records in GBIF.

One reason I'm enthusiastic about Glitch is that it is fun!. Some of the best shifts in technology that I've made have been because a tool made something easy and fun to do. For example, CouchDB made working with structured data fun, and that was a revelation (databases, fun, surely not). Fun is a much neglected characteristic of the tools we use.

Querying Wikidata

Over 7 million #SPARQL queries/day in @wikidata #WikiCite 👏🏻 pic.twitter.com/l2I6IcnGJj
— WikiCite (@Wikicite) May 23, 2017

For my own use more than anything else I've started creating a list of Wikidata SPARQL queries here. I personally don't find Wikidata's data model particularly easy to grasp, so one way to learn is to take the example queries on the Wikidata Query site and mess about with them.

For those interested in taxonomic data Wikidata is quite rich in content. For example, you can find the author of a taxonomic names, or find taxon names an author is responsible for creating.

It is also fairly straightforward to search for content by identifier, e.g.

SELECT *
WHERE
{
  ?work wdt:P356 "10.2476/ASJAA.62.33" .
}

will find the article with the DOI 10.2476/ASJAA.62.33. One minor gotcha is that Wikidata has all DOIs in UPPERCASE, so you either need to sera for uppercase version of the DOI, or use a filter to convert the case, which is slow.

As I come across interesting or useful queries I'll add them to the list in GitHub.

Wikidata, WikiCite, and the "bibliography of life"

3hhZSGOn 400x400 Last week I was at WikiCite 2017, a fascinating three day event in Vienna. Wikicite is "a proposal to build a bibliographic database in Wikidata to serve all Wikimedia projects", and is attracting increasing attention from academics, librarians, publishers, data geeks, and others. You can get a sense of the project by following @WikiCite on Twitter.

I went to the meeting in part to learn more about WikiCite, and also to spend some time hacking on Wikispecies. I'd been to only one Wiki event before (a Wiki Science Conference) so I'm still finding my way around this community. I spent the first two days listening to talks while coding away (more on this below), but on Wednesday put my own coding aside to join a bunch of people hacking the CrossRef event API in a great session led by Joe Wass. I've put some notes and code in GitHub. The event API tracks what people do with DOIs, including adding them to Wikipedia pages when citing a source in support of an assertion. A significant fraction of DOI resolutions are from Wikipedia pages, which is one reason why CrossRef was present at WikiCite.

Wikidata

In practice WikiCite's goal of building a bibliographic database to serve all Wikimedia projects means that articles, books, and other bibliographic items that are cited by Wikimedia projects will each be added to Wikidata. For example, the ZooKeys paper "Diversity of manota williston (Diptera, mycetophilidae) in ulu temburong national park, brunei" is item Q21188431 in Wikidata. Wikidata stores the key bibliographic metadata, including identifiers such as the DOI (which many at the WikiCite meeting pronounced "doy" much to my initial confusion). Screenshot 2017 05 31 12 46 43

This article was published in ZooKeys, which itself has a Wikidata item (Q219980), so in Wikidata the article is linked to the journal (i.e., "ZooKeys" isn't just a dumb string but a link to another Wikidata item). The article is also linked to two articles that it cites, and each of these is also a Wikidata item.

These citation links are one reason people are interested in WikiCite - it could be the basis of a free and open citation graph (for the benefits of such a graph see this piece by David Shotton doi:10.1038/502295a, a participant at the meeting in Vienna). Already some cool tools are being built on top of citation data in Wikidata, such as Scholia by Finn Årup Nielsen, Daniel Mietchen and Egon Willighagen. Here, for example, is my academic profile based on information in Wikidata. It's woefully incomplete, but intriguing. For a more complete example view Egon Willighagen's profile.

To some extent the utility of tools like Scholia will depend on how complete Wikidata's coverage is of the academic literature, which in turn raises the inevitable question of scope. Does Wikicite want to include just the literature cited in the various Wikimedia projects, or does it want to expand to include the total sum of academic literature?

Wikispecies, Wikidata and the bibliography of life

Wikispecies is one of the Wikimedia projects, and the only one that is topic-specific (the others are typically global in scope but have content in different languages, or host different data types such as images, scanned books, or structured data). As I've sketched out in an earlier post (Thoughts on Wikipedia, Wikidata, and the Biodiversity Heritage Library) I think Wikicite and Wikidata are potentially very important to projects such as BHL and the "bibliography of life". Much of our knowledge about the world's biodiversity is contained in the academic literature, and much of this is poorly known with no central database where we can find it, and much of it is still not digitised. It is tempting to think that Wikidata might be a platform around which the biodiversity community could focus its efforts on assembling a global database of biodiversity literature. Already major taxonomic journals such as ZooKeys are being fed into Wikidata, so it has a significant corpus of biodiversity literature already.

One way to grow this corpus is to focus on Wikispecies. In a post before the Wikicite meeting (Notes for WikiCite 2017: Wikispecies reference parsing) I elaborated on this idea. There are two stumbing blocks, one specific to Wikispecies, one a more general Wikidata issue.

The first issue is that Wikispecies bibliographic data is relatively unstructured, which makes converting it into structured data something of a challenge. I spent much of Wikicite hacking some code to do this on Glitch (more on Glitch later), you can see the results here: https://acoustic-bandana.glitch.me. This web site takes a Wikispecies reference and tries to convert it into CSL-JSON. Still very much a work in progress, but I've started building tools that use this web site as a service and process larger numbers of Wikispecies citations.

The second issue is how you get data into Wikidata, and this is something that's never been entirely clear to me. There are tools for adding an article using its DOI (sourcemd) but this isn't scalable, and doesn't handle the case of articles that don't have DOIs. This is still a "How do you Snapchat? You just Snapchat" moment. Wikidata desparately needs tools and a clear procedure whereby people like me with lots of bibliographic data can contribute.

Wikispecies

Another reason for my interest in Wikispecies (and other sources of bibliographic data such as the listed of cited literature being made available by CrossRef, see The Initiative for Open Citations) is that this data can be fed into BHL to locate more articles in that archive. Once these articles have been located they are stored in BioStor and BHL itself, but it makes sense to have them more accessible, and Wikidata looks to be an obvious candidate. Given that Wikispecies is essentially a crowd-source taxonomic database there is considerable overlap in content between Wikispecies and BHL. The Wikidata data model also allows for some of things that taxonomists care about, such as linking dates of publication to evidence relative to those dates (in older publications determining the publication date often requires quite extensive research).

Summary

Leaving aside the specific issues about how to get bibliographic data into Wikidata, I guess the question to ask is whether it makes sense to be developing large databases of bibliographic data without either using Wikidata as the platform to hold that data, or at least linking to Wikidata. Projects such as Gene Wiki are migrating from Wikipedia to Wikidata (see "Wikidata as a semantic framework for the Gene Wiki initiative" doi:10.1093/database/baw015), perhaps those of us interested in biodiversity literature could use projects like Gene Wiki as role models for how we could both contribute and benefit from Wikidata and Wikicite.

I've barely scratched the surface of what was discussed at Wikicite, for more details see the program. It is a very different sort of meeting in that the participants come from pretty diverse backgrounds, which helps shake up your own assumptions about what matters and how things should be done. It's also great that it's a meeting at which people write code or otherwise hack stuff together, so things actually get done. I've come away with lots to think about, and renewed enthusiasm about the role Wikimedia is playing in structuring our knowledge about the world.