Thursday, March 06, 2008

Word for the day - "transclusion"

Stumbled across Project Xanadu, Ted Nelson's vision of the way the web should be (e.g., BACK TO THE FUTURE: Hypertext the Way It Used To Be). Nelson coined the term "transclusion", including one document in side another by reference. The screen shot of Xanadu Space may help illustrate the idea:

Nelson envisages a web where instead of just one-way links, documents include parts of other documents, and one can view a document side-by-side with the source documents. Modern web browsers transclude images (the image file is not "physically" in the document, rather it exists elsewhere), but mostly they link to other documents via hyperlinks.
Ted Nelson's writings are a fascinating read, partly because they remind you just how much of the web we take for granted, and how thinks could be different (better?). One thing he objects to is that much of the the web simulates paper
Much of the field has imitated paper: in word processing (Microsoft Word and Adobe Acrobat) and the World Wide Web, whose rectangular page layouts become a focal issue. It should be noted that these systems imitate paper under glass, since you can't annotate it.
Nelson also advocates every element of a document having its own unique address, not just at book or article level. This resonates with what is happing with digital libraries. Gregory Crane in his article "What Do You Do with a Million Books?" (doi:10.1045/march2006-crane) notes that:
Most digital libraries still mimic their print predecessors, treating individual objects – commonly chunks of PDF, RTF/Word, or HTML with no standard internal structure – as its constituent units. As digital libraries mature and become better able to extract information (e.g., personal and place names), each word and automatically identifiable chunk of words becomes a discrete object. In a sample 300 volume, 55 million word collection of nineteenth-century American English, automatic named entity identification has added 12,000,000 tags. While this collection focuses on name rich historical materials and includes several reference works, this system already discovers thousands of references to named entities in most book length documents. We thus move from single catalogue entries with a few hundred words to thousands of tagged objects – an increase of at least one order of magnitude with named entities and of at least two orders of magnitude when we consider each individual word as an object.
I discovered Crane's paper via Chris Freeland's post On Name Finding in the BHL. Chris summarises BHL's work on scanning biodiversity literature and extracting taxonomic names. BHL's output is at the level of pages, rather than articles. Existing GUIDs for literature (such as DOIs and SICIs) typically identify articles rather than pages (or page elements), so there's a need to extending these to pages.

Chris also raises the issue of ranking and relevance -- "What do you do with 19,000 pages containing Hymenoptera?". One possibility is to explore Robert Huber's TaxonRank idea (inspired by Google's PageRank). This would require text mining to build synonomy lists from scanned papers, challenging but not impossible. But I suspect that the network of citations is what will help build a sensible way to rank the 19,000 pages.

A while ago people were speculating what Google could do to help biodiversity informatics. I found much of this discussion to be vague, with no clear notion of what Google could actually do. What I think Google is exceptionally good at are two things we need to tackle -- text mining, and extracting information from links. I think this is where BHL and, by extension, EOL, should be devoting much of their resources.

3 comments:

Chris Freeland said...

Rod - Thanks for posting the link to TaxonRank. I hadn't heard about it, but it does certainly look promising to address the problems I've mentioned with BHL.

sophiya said...

There are so many companies for scannig the documents. But I found a good image scanning documents of all types and sizes to all standards including, TIFF, JPEG, High Compression PDF, Black & White, Grayscale and true Color.

sexy said...

情趣用品,情趣用品,情趣用品,情趣用品,情趣用品,情趣用品,情趣用品,情趣用品,情趣,情趣,情趣,情趣,情趣,情趣,情趣,情趣,A片,視訊聊天室,聊天室,視訊,視訊聊天室,080苗栗人聊天室,上班族聊天室,成人聊天室,中部人聊天室,一夜情聊天室,情色聊天室,視訊交友網

免費A片,AV女優,美女視訊,情色交友,免費AV,色情網站,辣妹視訊,美女交友,色情影片,成人影片,成人網站,A片,H漫,18成人,成人圖片,成人漫畫,情色網,日本A片,免費A片下載,性愛

A片,色情,成人,做愛,情色文學,A片下載,色情遊戲,色情影片,色情聊天室,情色電影,免費視訊,免費視訊聊天,免費視訊聊天室,一葉情貼圖片區,情色,情色視訊,免費成人影片,視訊交友,視訊聊天,視訊聊天室,言情小說,愛情小說,AIO,AV片,A漫,avdvd,聊天室,自拍,情色論壇,視訊美女,AV成人網,色情A片,SEX,成人論壇

情趣用品,A片,免費A片,AV女優,美女視訊,情色交友,色情網站,免費AV,辣妹視訊,美女交友,色情影片,成人網站,H漫,18成人,成人圖片,成人漫畫,成人影片,情色網


情趣用品,A片,免費A片,日本A片,A片下載,線上A片,成人電影,嘟嘟成人網,成人,成人貼圖,成人交友,成人圖片,18成人,成人小說,成人圖片區,微風成人區,成人文章,成人影城,情色,情色貼圖,色情聊天室,情色視訊,情色文學,色情小說,情色小說,臺灣情色網,色情,情色電影,色情遊戲,嘟嘟情人色網,麗的色遊戲,情色論壇,色情網站,一葉情貼圖片區,做愛,性愛,美女視訊,辣妹視訊,視訊聊天室,視訊交友網,免費視訊聊天,美女交友,做愛影片

av,情趣用品,a片,成人電影,微風成人,嘟嘟成人網,成人,成人貼圖,成人交友,成人圖片,18成人,成人小說,成人圖片區,成人文章,成人影城,愛情公寓,情色,情色貼圖,色情聊天室,情色視訊,情色文學,色情小說,情色小說,色情,寄情築園小遊戲,情色電影,aio,av女優,AV,免費A片,日本a片,美女視訊,辣妹視訊,聊天室,美女交友,成人光碟

情趣用品.A片,情色,情色貼圖,色情聊天室,情色視訊,情色文學,色情小說,情色小說,色情,寄情築園小遊戲,情色電影,色情遊戲,色情網站,聊天室,ut聊天室,豆豆聊天室,美女視訊,辣妹視訊,視訊聊天室,視訊交友網,免費視訊聊天,免費A片,日本a片,a片下載,線上a片,av女優,av,成人電影,成人,成人貼圖,成人交友,成人圖片,18成人,成人小說,成人圖片區,成人文章,成人影城,成人網站,自拍,尋夢園聊天室