Thursday, March 06, 2008

Word for the day - "transclusion"

Stumbled across Project Xanadu, Ted Nelson's vision of the way the web should be (e.g., BACK TO THE FUTURE: Hypertext the Way It Used To Be). Nelson coined the term "transclusion", including one document in side another by reference. The screen shot of Xanadu Space may help illustrate the idea:

Nelson envisages a web where instead of just one-way links, documents include parts of other documents, and one can view a document side-by-side with the source documents. Modern web browsers transclude images (the image file is not "physically" in the document, rather it exists elsewhere), but mostly they link to other documents via hyperlinks.
Ted Nelson's writings are a fascinating read, partly because they remind you just how much of the web we take for granted, and how thinks could be different (better?). One thing he objects to is that much of the the web simulates paper
Much of the field has imitated paper: in word processing (Microsoft Word and Adobe Acrobat) and the World Wide Web, whose rectangular page layouts become a focal issue. It should be noted that these systems imitate paper under glass, since you can't annotate it.
Nelson also advocates every element of a document having its own unique address, not just at book or article level. This resonates with what is happing with digital libraries. Gregory Crane in his article "What Do You Do with a Million Books?" (doi:10.1045/march2006-crane) notes that:
Most digital libraries still mimic their print predecessors, treating individual objects – commonly chunks of PDF, RTF/Word, or HTML with no standard internal structure – as its constituent units. As digital libraries mature and become better able to extract information (e.g., personal and place names), each word and automatically identifiable chunk of words becomes a discrete object. In a sample 300 volume, 55 million word collection of nineteenth-century American English, automatic named entity identification has added 12,000,000 tags. While this collection focuses on name rich historical materials and includes several reference works, this system already discovers thousands of references to named entities in most book length documents. We thus move from single catalogue entries with a few hundred words to thousands of tagged objects – an increase of at least one order of magnitude with named entities and of at least two orders of magnitude when we consider each individual word as an object.
I discovered Crane's paper via Chris Freeland's post On Name Finding in the BHL. Chris summarises BHL's work on scanning biodiversity literature and extracting taxonomic names. BHL's output is at the level of pages, rather than articles. Existing GUIDs for literature (such as DOIs and SICIs) typically identify articles rather than pages (or page elements), so there's a need to extending these to pages.

Chris also raises the issue of ranking and relevance -- "What do you do with 19,000 pages containing Hymenoptera?". One possibility is to explore Robert Huber's TaxonRank idea (inspired by Google's PageRank). This would require text mining to build synonomy lists from scanned papers, challenging but not impossible. But I suspect that the network of citations is what will help build a sensible way to rank the 19,000 pages.

A while ago people were speculating what Google could do to help biodiversity informatics. I found much of this discussion to be vague, with no clear notion of what Google could actually do. What I think Google is exceptionally good at are two things we need to tackle -- text mining, and extracting information from links. I think this is where BHL and, by extension, EOL, should be devoting much of their resources.

2 comments:

Chris Freeland said...

Rod - Thanks for posting the link to TaxonRank. I hadn't heard about it, but it does certainly look promising to address the problems I've mentioned with BHL.

Anonymous said...

There are so many companies for scannig the documents. But I found a good image scanning documents of all types and sizes to all standards including, TIFF, JPEG, High Compression PDF, Black & White, Grayscale and true Color.