iPhylo: July 2019

Roderic D. M. Page

Monday, July 15, 2019

Notes on collections, knowledge graphs, and Semantic Web browsers

While working with linked data and ways to explore and visualise information, I keep coming back to the Haystack project, which is now over a decade old. Among the tools developed was the Haystack application, which enabled a user to explore all sorts of structured data. Below is a screen shot of Haystack showing a sequence for Homo sapiens cyclin T1 (CCNT1), transcript variant a, mRNA. Note the use of a LSID to identify the sequence (LSIDs were actively being used to identify bioinformatics resources) urn:lid:ncbi.nlm.nih.gov.lsid.i3c.org:genbank:nm_001240.

For some background on the Haystack project see How to Make a Semantic Web Browser DOI:10.1145/988672.988707 (PDF) and Haystack: A Customizable General-Purpose Information Management Tool for End Users of Semistructured Data PDF.
One reason I keep coming back to the Haystack project is the notion of having a personal space for exploring linked data. One of the challenges of having a large knowledge graph is that it becomes hard to have "local" queries. That is, queries which are restricted to a subset of things that you care about.

For example, while playing around with Ozymandias I keep coming across interesting species, such as Milyeringa justitia (see FIGURE 5 in A new species of the blind cave gudgeon Milyeringa (Pisces: Gobioidei, Eleotridae) from Barrow Island, Western Australia, with a redescription of M. veritas Whitley).

If I want to explore this taxon in more detail I'd like to have the original description, any relevant DNA sequences (e.g., MG543430), any papers publishing those sequences (e.g., Multiple molecular markers reinforce the systematic framework of unique Australian cave fishes (Milyeringa : Gobioidei)), and phylogenetic analyses such as the paper The First Record of a Trans-Oceanic Sister-Group Relationship between Obligate Vertebrate Troglobites which establishes a link between Milyeringa and a genus of cave fish endemic to Madagascar (Typhleotris).

What I'd like to be able to do is collect all these sources (ideally by simply bookmarking the links), saving them as a "collection", then at some point exploring what the knowledge graph can tell me. The importance of having a collection is so that I can tell the knowledge graph that I just want to explore a subset of information. Without a collection it can be tricky to limit the scope of queries. For example, given a global knowledge graph such as Wikidata, how would you query just species found in Australia? You would typically rely on the species having either a property ("found in Australia"), or perhaps an identifier that is only used for Australian species. Neither of these is particularly satisfactory, especially if there isn't a property that fortuitously matches the scope or your inquiry.
Hence, I'm interested in having collections: lists of entities that I want to know more about. I need ways to create these collections, ways to describe them, and ways to explore them. In some ways the collections feature of EOL was close to what I'm after. In the previous version of EOL you could "collect" taxa that you were interested in (for example, species that were blue) (see I think I now "get" the Encylopedia of Life). Sadly, collections (along with JSON-LD export and stable image URLs) have vanished from the new EOL (which seems to be in a death spiral driven by some really unfortunate decisions). And collections need to be able to contain any entity, not just taxa.

One way to represent collections in the linked data world is using RSS feeds, or their schema.org descendant, the DataFeed (see also Google's Data Feed Validation Tool). So, we could collect a series of things we are interested in, create the corresponding DataFeed, import that into our Knowledge Graph and that would give us a way to scope our queries (using membership of the DataFeed to select the species, papers, sequences, etc. that we are interested in). As an aside, there's also some overlap with another MIT project of old, David Huynh's Parallax project which explored querying on a set of objects, rather than one object at a time. This is the functionality that a collection gives you (if you have a query language like SPARQL which can work on sets of things).

Returning to Haystack, I'm intrigued by the idea of building a personal linked data browser. In other worlds, a browser that stores data that is relevant to projects you are working on (e.g., blind fish) as collections (data feeds), but can query a global knowledge graph to augment that information. SPARQL supports federated queries, so this is eminently doable. The local browser would have its own triple store, which could be implemented using Linked Data Fragments.

For now this is just a jumble of poorly articulated ideas, but I think much of the power of linking data together will be lost until we have simple tools that enable us to explore the data in ways that are relevant to what we actually want to know. Haystack gives us one model of what such a tool could look like.