Rants, raves (and occasionally considered opinions) on phyloinformatics, taxonomy, and biodiversity informatics. For more ranty and less considered opinions, see my Twitter feed. ISSN 2051-8188. Written content on this site is licensed under a Creative Commons Attribution 4.0 International license.
Modelling taxa is a bit trickier. I've sketched my ideas for distinguishing name strings and taxonomic names earlier. That's the easy stuff. What about "taxonomic concepts" and "OTUs"? As a first pass, I'm looking at linking taxon names to classifications via GUIDs. If a taxon appears in a classification then the GUID of the corresponding node in the classification is an attribute of the taxon name, and each classification GUID (representing a node in a classification) corresponds to a page in the Wiki.
The trick here is going to be ensuring that I can do sensible queries, such as linking a node in a classification to alternative names.
The other entity that I need to think carefully about are OTUs (Operational Taxonomic Units). By OTUs I mean the taxa that appear in phylogenetic trees. In the TbMap project I mapped TreeBASE taxa to names in external databases, but noted that TreeBASE taxa are better thought of as OTUs:
...many taxon names in TreeBASE are best though of as Operational Taxonomy Units (OTUs) rather than taxonomic names. They identify a set of observations for a particular specimen, set of specimens, or a taxon. For instance, "Eleutherodactylus crassidigitus FMNH257676 Panama" (TaxonID T51971) refers to a 1200 base pair stretch of mitochondrial DNA (AY273113) obtained from Field Museum Natural History specimen FMNH 257676, which has been identified as Eleutherodactylus crassidigitus. [see doi:10.1186/1471-2105-8-158.
Taxa in phylogenetic trees may be single sequences, multiple sequences (from one or more specimens), or aggregates of information from multiple taxa. The challenge is to model these in the simplest way that reflects this, but also makes queries feasible. What I'm aiming for is for the user to click on a node in a phylogeny, and be taken to a page that best corresponds to the entity in the tree, but at the same time enable queries that will list all phylogenies that contain a given taxon.
Time to make some notes. I've been playing with using Sematic Mediawiki to create a database of taxonomic names, literature, specimens, sequences, and phylogenies. One challenge is to come up with simple ways to model these entities, in a way that makes both data entry simple and querying as simple as possible. Some things are straightforward. For example, a publication can be modelled like this: OK, I've ignored the attributes. The diagram simply shows the use of MediaWiki REDIRECT to enable the use of standard publication GUIDs as Wiki page names (see earlier posts for more details, and a hack to deal with problem characters in DOIs). One benefit of GUID REDIRECTs is that I can refer to publications using GUIDs, and the wiki user will be taken to the article page without any fuss.
Likewise, we can model a journal like this: Again, GUIDs are REDIRECT pages. This means an article page can have the ISSN of the publication it appears in as one of its attributes, and we can then use ISSNs in our queries.
People are a bit trickier, given the absence of GUIDs (or the desire to keep obvious ones, such as email addresses, private) (see doi:10.1371/journal.pcbi.1000247 for some background). I plan to have a single page for each author, and have alternative spellings link to that page:
This is one motivation for my work on equivalent author names. By finding clusters of equivalent names it would be possible to pre-populate the wiki with author names from bibliographic databases, whilst minimising the number of duplicate pages for the same author.
In conjunction with the TV show, the Wellcome Trust has launched the Interactive Tree of Life, a Flash-based view of the tree of life. There's also a blog about the project. Here's a demo of the tree:
The tree looks very nice, and a lot of work has gone into it, but I am somewhat underwhelmed. The tree itself is tiny, and does a poor job of conveying the relative diversity of life (e.g., no plants, bacteria, few arthropods, etc.). It displays the tree on a 2D plane, and the user can move relative to that plane. I'm not convinced this is the best way to display large trees. Something modelled on Perceptive Pixel's demo might be more useful. I blogged about this last year, but the video host service has disappeared. You can see the tree display 50 seconds in to the video below:
Out of curiosity I grabbed the code from the web site (a 1.5Gb file) and had a quick look. The bulk of the files are media, such as images, movies, and 3D Maya models. There's some nice stuff here. The actual tree itself is there in New Hampshire eXtended format. Here it is displayed in TreeView X:
Well, not Darwin himself, exactly. The Evolution Directory (better known as "EvolDir") is a mailing list run by Brian Golding at McMaster University, Ontario. It's widely used by evolutionary biologists to post announcements about jobs, courses, conferences, software, and other topics of interest to the community.
In this age of spam- and administrivia-clogged inboxes I find it hard to keep track of emails (and routinely ignore most), so it occurred to me that I'd pay more attention to EvolDir if the announcements were made over twitter. Hence, I wrote a script to monitor my email account for EvolDir emails, and post any announcements to twitter. You can follow EvolDir by going to http://twitter.com/evoldir.
One complication was working out a URL for individual EvolDir emails. To make my life simpler I created a simple web archive of each individual email. As a by product of this there is now an RSS feed for EvolDir, available from http://bioguid.info/services/evoldir/, if RSS is your preferred means of consuming news. I've added the EvolDir RSS feed to the Systematic Biology web site, so that visitors to that site can see the latests announcements from the evolutionary biology community.
One problem I've encountered in building a bibliographic database is the different ways author names are written. For example, for papers I've authored my name may be written as "Roderic D. M. Page" or "R. D. M. Page". Googling about this problem I came across Dror Feitelson's paper On identifying name equivalences in digital libraries. Feitelson addresses the issue of matching first names:
The services provided by digital libraries can be much improved by correctly identifying variants of the same name. For example, this will allow for better retrieval of all the works by a certain author. We focus on variants caused by abbreviations of first names, and show that significant achievements are possible by simple lexical analysis and comparison of names. This is done in two steps: first a pairwise matching of names is performed, and then these are used to find cliques of equivalent names. However, these steps can each be performed in a variety of ways. We therefore conduct an experimental analysis using two real datasets to find which approaches actually work well in practice. Interestingly, this depends on the size of the repository, as larger repositories may have many more similar names.
Feitelson's solution is to construct a graph of similarity between first names, then find weighted cliques grouping equivalent names. For example, given the first names "Ace D. E.", "A. D.", "Abe F. G.", "Abe Bob C.", "A. B. C.", and "Abe B", we create the graph below where the edges are weighted by similarity between the names: In this example, the names "Abe Bob C", "A B C", and "Abe B" are equivalent, as are "Ace D E" and "A D", leaving "Abe F G" by itself.
I've implemented Feitelson's weighted clique algorithm in a PHP script that calls a C++ program that does the clique analysis. Results can be returned in HTML or JSON. You can try the service at http://bioguid.info/services/. You can also call the service directly by a HTTP POST request to the URL http://bioguid.info/services/equivalent.php with these parameters:
Parameter
Value
Description
names
string
List of first names, separated by end of line (\n) character
Yes, I know this is ultimately a case of the "genius of and", but the more I play with the Semantic Mediawiki extension the more I think this is going to be the most productive way forward. I've had numerous conversations with Vince Smith about this. Vince and colleagues at the NHM have been doing a lot of work on "Scratchpads" -- Drupal based webs sites that tend to be taxon-focussed. My worry is that in the long term this is going to create lots of silos that some poor fool will have to aggregate together to do anything synthetic with. This makes inference difficult, and also raises issues of duplication (for example, in bibliographies).
I've avoided wikis for a while because of the reliance on plain text (i.e., little structure) (see this old post of mine on Semant), but Semantic Mediawiki provides a fairly simple way to structure information, and it also provides some basic inference. This makes it possible to create wiki pages that are largely populated by database queries, rather than requiring manual editing. For example, I have queries now that will automatically populate a page about a person with that person's publications, and any taxa named after that person. The actual wiki page itself has hardly any text (basically the name of the person). That is, nobody has to manually edit the wiki page to update lists of published papers. Similarly, maps can be generated in situ using queries that aggregate localities mentioned on a wiki page with localities for GenBank sequences and specimens. Very quickly relationships start to emerge without any manual intervention. The combination of templates and Semantic Mediawiki queries seems a pretty powerful way to aggregate information. There are, of course, limitations. The queries are fairly basic, and there's not the power of something like SPARQL, but it's a start. Coupled with the ease of editing to fix the errors in the contributing databases, and the ease of redirecting to handle multiple identifiers, I think a wiki-based approach has a lot of promise.
So, I've been teasing Vince that Drupal (or another CMS) is probably the wrong approach, and that semantic wikis are much more powerful (something Gregor Hagedorn has also been arguing). Vince would probably counter that the goal of scratchpads is to move taxonomists into the digital age by providing them with a customisable platform for them to store and display their data, hence his mission is to capture data. My focus is more to do with aggregating and synthesising the large amount of data we already have (and are struggling to do anything exciting with). Hence, the "genius of and". However, I still worry that when we have a world with loads of scratch pads with overlapping data, some poor fool will still have to merge them together to make sense of it all.
Success is the ability to go from failure to failure without losing your enthusiasm -- Winston Churchill
I learnt today that my Elsevier Challenge entry didn't make the final cut. This wasn't unexpected. In the interests of "open science" (blame Paulo Nuin) here is the feedback I received from the judges:
Strengths Beautiful presentation, lovely website. Page clearly made his case for open access to metadata/full articles in order to allow communities to build the tools they want. The judges would have enjoyed seeing more elements from the original abstract (tree of life). Great contribution so far to the discussion; Page made his point very well.
Weaknesses Given that no specific tool was proposed, this submission is somewhat out of scope for the competition. Nonetheless, in support of his point, Page could have elaborated on the kinds of open formats and standards for text and data and figures that would support integrated community-wide tool-building. Alternatively, if the framework and the displayed functionalities were to be the submission, there could have been more discussion of how others can integrate their plug-ins and make them cross-referential to the plug-ins of others. The proposal for Linked Data should utilize Semantic Web standards
Elements to Consider for Development How many, and which types of, information substrates? How much work for a new developer to create a new one, and to make this work? How to incentivize authors to produce the required metadata? Or to make the data formats uniform?
I think this is a pretty fair evaluation of my entry. I was making a case for what could be done, rather than providing a specific bit of kit that could make this happen right now. I think I was also a little guilty of not following the "underpromise but overdeliver" mantra. My original proposal included harvesting phylogenies from images, and that proved too difficult to do in the time available. I don't think having trees would have ultimately changed the result (i.e., not making the cut), but it would have been cool to have them.
Anyway, time to stomp around the house a bit, and be generally grumpy towards innocent children and pets. Congratulations to the bastards fellow contestants who made it to the next round.
One advantage of flying to the US is the chance to do some reading. At Newark (EWR) I picked up Guy Kawasaki's "Reality Check", which is a fun read. You can get a flavour of the book from this presentation Guy gave in 2006.
While at MIT for the Elsevier Challenge I was browsing in the MIT book shop and stumbled across "Google and the Myth of Universal Knowledge" by Frenchman Jean-Noël Jeanneney. It's, um, very French. I have some sympathy with his argument, but ultimately it comes across as European whining about American success. And the proposed solution involves that classic European solution -- committees! In many ways it's really a librarian complaing about Google (again), which librarians just need to get over:
OK, I'm not really doing the arguments justice, but I'm getting a little tired of European efforts that are essentially motivated by "well the Americans are doing this, so we need to do something as well." Lastly, I also bought Linda Hill's "Georeferencing: The Geographic Associations of Information", which is a little out of date (what, no Google Maps or Google Earth?), but is nevertheless an interesting read, and has lots of references to georeferencing in biodiversity informatics. Given that my efforts for the challenge in this area where so crude, it's something I need to think about a bit more deeply.
Quick post about the Elsevier Challenge, which took place yesterday in the wonderful Stata Center at MIT. It was a great experience. Cool venue, interesting talks, probing questions (having a panel of judges ensured that everybody got feedback/queries). Some talks (like mine) were more aspirational (demos of what could be done), others, such as Sean O'Donoghue's talk on Reflect, and Stephen Wan's on CSBIS (see "In-Browser Summarisation: Generating Elaborative Summaries Biased Towards the Reading Context") were systems that Elsevier could plug in to their existing Science Direct product (and hence are my picks to go forward to the last round).
I was typically blunt in my talk, especially about how useless Science Direct's "2collab" and "Related articles" features were. Rafael Sidi is not unsympathetic to this, and I think despite their status as the Microsoft of publishing (for the XBox crowd, that's a Bad Thing™), the Elsevier people at the meeting were genuinely interested in changing things, and exploring how best to disseminate knowledge. There's hope for them yet! Oh, and special thanks to Anita de Ward and Noelle Gracy for organising the meeting, and the smooth running of the Challenge.
I'm in the US on UK time, so this is probably a bad idea to write this, but the paper by Malte Ebach et al. ("O Cladistics, Where Art Thou?", doi:10.1111/j.1096-0031.2008.00225.x) in the latest Cladistics just annoys me too much. Rather than the call to arms that the authors intend, I think they've provided one more example of the death throes of cladistics (in the narrow parsimony is all, statistical methods are evil, molecular systematics is phenetics, barcoding is killing taxonomy sense).
Associations, such as the Willi Hennig society, and journals, such as Cladistics, were erected in order to tackle the growing problem of pheneticists, purveyors of overall similarity, clustering and divergence rates. Rather than challenge molecular systematists and their numerical taxonomic methods, we take part. Where is our integrity?
Gosh, maybe people realised that molecular data are useful, that molecular data benefit from statistical analysis, and that divergence rates (and times) were of great biological interest? Fancy that!
What happened to the Cladistic Revolution? Today, students appear to have no knowledge of that Revolution. They graduate as students did so before the Revolution, with a sound knowledge of phenetics, ancestor worship and a healthy dose of molecular genetics. What happened to taxonomy and cladistics?
I suspect the real drivers in the "Revolution" were: the development methods that could be implemented in computer software (I include parsimony in this); computer hardware that was becoming cheaper and more powerful; and, the growth of molecular data (i.e., data that was easily digitised). I don't mean to imply that everything was technologically driven, but I suspect it was a combination of a desire to infer evolutionary trees coupled with plausible means of doing so that drove the "revolution", rather than any great conceptual framework.
Such matters as the Phylocode, DNA taxonomy and barcoding, for example, have risen to prominence despite criticism of their many flaws and illogical conclusions. The attempts of these applied technologies to derail almost 250 years of scholarship are barely even questioned by our own peers with only a few taking a stand (e.g., Will and Rubinoff, 2004; Wheeler, 2005).
Barcoding is happening, get over it. There are technical issues with its ability to identify "species", but to object to it on ideological grounds (as papers published in Cladistics tend to do) is ultimately futile. If the authors dealt with bacteria they wouldn't bat an eyelid. Besides, I suspect that the ability to identify organisms, or discover clusters of similar sequences will be among the least interesting applications of barcoding. There will be a wealth of standardised, geotagged data from across life around the planet. People not blinkered by ideology will do interesting things with these data.
Barcoding is understood as a ‘‘solution’’ (to what, one might ask?), systematics journals are infested with phenetics and population genetics (cladistics has vanished), both, seemingly, directing the course and future of taxonomy. Where are the scholars?
Personally I use the term "phenetics" as a litmus test. If anybody says that a method is "phenetic" then I pretty much switch off. Almost always, if somebody uses this term they simply don't understand what they are talking about. If you describe a method as "phenetic" then that tells me that you either don't understand the method, or you're too lazy to try and understand it.
In some ways all this saddens me. I was an undergraduate student around the time of the heyday of the New York school, thought Systematics and Biogeography: Cladistics and Vicariance was a great (if flawed) book (and I still do), and did my first post doc with Gary Nelson at the AMNH. It was a great time to be a student. Phylogenetic trees were appearing in all sorts of places, and systematists were tackling big topics such as biogeography, diversification, coevolution, and development. There was a sense of ambition, and excitement. Yet now it seems that Cladistics has become a venue for reactionary rants by people unable to break out of the comforting (but ultimately crippling) coherence of the hard-core cladist's world view.
The case of the red lionfish exemplfies how EOL can provide information for science-based decision making. Red lionfish are native to coral reef ecosystems in the Indo-Pacific. Yet, probably due to human release of the fish from aquariums, a large population has found itself in the waters near the Bahamas.
Nope, I suggest it demonstrates just how limited EOL is. If I view the page for the red lionfish I get an out of date map from GBIF that shows a very limited distribution, and doesn't show the introductions in Florida and the Bahamas (I have to wade through text to find reference to the Florida introduction, and the page doesn't mention the Bahamas!). The blog entry states that
In this senerio[sic], EOL and its data partners provide up to date information about the lionfish, or pterois[sic] volitans, in a species page.
In other words, EOL in it's present state is serving limited, out of date information. The gap between hype and delivery shows no sign of narrowing. How can this help "science-based decision making"? Surely there will come a point when people will tire of breathless statements about how EOL will be useful, and they will start to ask "where's the beef?"
Quick note to say how much I like the programmers' Q & A site Stack Overflow. I've only asked two questions, but the responses have been rapid and useful. I found out about Stack Overflow by listening to the Stack Overflow podcast episodes on IT Conversations (which carry a lot of other podcasts as well). For a wannabe geek, these podcasts are a great source of ideas.
The idea is to display a table in a fixed space. As you mouse over a cell, the contents of the cell, and the relevant row and column labels become visible. This enables you to get an overview of the full table, but still see individual items:
It's easier to show than explain. For example, take a look at The amphibian tree of life, or watch this short screencast:
There are some things to fix. Firstly, I group all sequences by NCBI taxon and gene "features". If there's more than one sequence for the same gene and taxon, I just show one of them (an obvious solution is to add a popup menu if there's more than one sequence). Secondly, the gene "names" are extracted from GenBank feature tables, and will include synonyms and duplicates (for example, a sequence may have a gene feature "RAG-1" and a CDS feature "recombination activating protein 1"). I've stored all of these as not every sequence is consistently labelled, so excluding one class of feature may loose all labels from a sequence. At some point it would be useful to cluster gene names (a task for another day).
The girl is Carmen Electra, which is understandable given the Yahoo image search was for "Electra" (a genus of bryozoan). However, what are the wild men (and women) doing at the top? Turns out this is the result of searching for the genus Homo. But why, you ask, does a paper on bryozoans have human sequences? Well, looks like the table in the paper has incorrect GenBank accession numbers. The sequences AJ711044-50 should, I'm guessing, be AJ971044-50.
Ironically, although it was Carmen Electra's photo that initially made me wonder what was going on, it's really the hairy folks above her image that signal something is wrong. I've come across at least one other example of a paper citing an incorrect sequences, so it might be time to automate this checking. Or, what is probably going to be more fun, looking at treemaps for obviously wrong images and trying to figure out why.
One of the things I've struggled with most in putting together a web site for the challenge is how to summarise that taxonomic content of a study. Initially I was playing with showing a subtree of the NCBI taxonomy, highlighting the taxa in the study. But this assumes the user is familiar with the scientific names of most of life. I really wanted something that tells you "at a glance" what the study is about.
I've settled (for now, at least) on using a treemap of images of the taxa in the study. I've played with treemaps before, and have never been totally convinced of their utility. However, in this context I think they work well. For each paper I extract the taxonomic names (via the Genbank sequences linked to the paper), group them into genera, and then construct a treemap where the size of each cell is proportional to the number of species in each genus. Then I harvest images from Flickr and/or Yahoo's image search APIs and display a thumbnail with a link to the image source.
I'm hoping that these treemaps will give the user an almost instant sense of what the study is about, even if it's only "it's about plants". The treemap above is for Frost et al.'s The amphibian tree of life (doi:10.1206/0003-0090(2006)297[0001:TATOL]2.0.CO;2), the one to the right is for Johnson and Weese's "Geographic distribution, morphological and molecular characterization, and relationships of Lathrocasis tenerrima (Polemoniaceae)".
Note that the more taxa a study includes the smaller and more numerous the cells (see below). This may obscure some images, but gives the user the sense that the study includes a lot of taxa. The image search isn't perfect, but I think it works well enough for my purposes.
Elsevier have released this video about the challenge, featuring a few of the contestants. I couldn't get my act together in time to send anything useful, and having seen the 16 gigabytes song (full version here), I'm glad I didn't -- there's just no way I could compete with Michael Greenacre and Trevor Hastie.
Basically, OpenRef is a human-readable identifier for an article, based on concatenating the journal name, year of publication, volume number, and starting page, for example:
A key cosmetic (and philosophical) difference between OpenURL and OpenRef/ResolveRef URLs is that OpenURL uses HTTP GET fields, eg ?title=bla&issn=12345, while OpenRef/ResolveRef uses the URL path itself eg, somejournalname/2008/4/1996. It’s a bit like one scheme was designed in the age of CGI scripts, while the other was designed for web applications capable of more RESTful behaviour. In my mind OpenURL is more versatile but much uglier, while OpenRef is cleaner and simpler but can only reference journal articles.
Of course, it is straightforward to add openref-style URLs to an OpenURL resolver by using URL rewriting, for example:
I've done this for my resolver. One limitation of OpenRef is that there are many different ways to write a journal's name, so you can't determine whether two OpenRef's refer to the same journal by simply string matching (as you can with a DOI, for example -- if the DOI's are different the article is different). For example I might write BMC Bioinformatics and you might write BMC Bioinf.. One way around tis is to have unique identifiers for journals, which of course is the approach Robert Cameron advocated with Universal Serial Item Names and JACC's. The obvious candidate for journal identifier is the ISSN. I guess the problem here is that it's easier to use the journal name rather than require the user to know the ISSN. OpenRefs are certainly easier to write. Hence, I think they are great as a simple way for people to construct a resolvable URL for an artcle, but not so great as an identifier.
Elsevier Labs is inviting creative individuals who have wanted the opportunity to view and work with journal article content on the web to enter the Elsevier Article 2.0 Contest. Each contestant will be provided online access to approximately 7,500 full-text XML articles from Elsevier journals, including the associated images, and the Elsevier Article 2.0 API to develop a unique yet useful web-based journal article rendering application. What if you were the publisher? Show us your preference!
Elsevier are clearly looking for ideas (they also have their Grand Challenge), and there's been some interesting commentary on the Article 2.0 contest. The site provides some sample applications (written in XQuery), which you can play with by going to the list of journals that are included in the challenge and clicking down through volume and issue until you get to individual articles.
Watch CBS Videos Online CBS News Sunday Morning Segment on the EOL. All fun stuff (Paddy skewering the interviewer who fails to recognise an echidna), but still long on promises and short on actual product.
One problem with my cunning plan to use Mediawiki REDIRECTs to handle DOIs is that some DOIs, such as those that BioOne serves based on SICIs contain square brackets, [ ], which conflicts with wiki syntax. For example, doi:10.1206/0003-0090(2006)297[0001:TATOL]2.0.CO;2 I want to enable users to enter a raw DOI, so I've been playing with a simple URL rewrite in Appache httd.conf, namely:
This rewrites the [ and ] in the original DOI, then forces a new HTTP request (hence the [NC,R] at the end of the line). This keeps Mediawiki happy, at the cost of the REDIRECT page having a DOI that looks a slightly different from the original. However, it means the user can enter the original DOI in the URL, and not have to manually edit it.
Bibliographic coupling is a term coined by Kessler (doi:10.1002/asi.5090140103) in 1963 as a measure of similarity between documents. If two documents, A and B, cite a third, C, then A and B are coupled.
I'm interested in extending this to data, such as DNA sequences and specimens. In part this is because within the challenge dataset I'm finding cases where authors cite data, but not the paper publishing the data. For example, a paper may list all the DNA sequences in uses (thus citing the original data), but not the paper providing the data.
To make this concrete, the paper "Towards a phylogenetic framework for the evolution of shakes, rattles, and rolls in Myiarchus tyrant-flycatchers (Aves: Passeriformes: Tyrannidae)" doi:10.1016/S1055-7903(03)00259-8 lists the sequences used, but does not cite the source of three of these (which is the Science paper "Nonequilibrium Diversity Dynamics of the Lesser Antillean Avifauna" (doi:10.1126/science.1065005). As a result, if I was reading "Nonequilibrium Diversity Dynamics of the Lesser Antillean Avifauna" and wanted to learn who had cited it I would miss the fact that paper "Towards a phylogenetic framework for the evolution of shakes, rattles, and rolls..." had used the data (and hence, in effect, "cited" the paper). In some cases, data citation may be more relevant than bibliographic citation because it relates to people using the data, which seems a more significant action than simply reading the paper.
Note that I'm not interested in the issue of credit as such. In the above example, the authors of the Science paper are also coauthors of the "shakes, rattles, and rolls" paper, and hence show commendable restrain in not citing themselves. I'm interested in the fate of the data. Who has used it? What have they done with it? Has anybody challenged the data (for example, suggesting a sequence was misindentified)? These are the things that a true "web of data" could tell us.
Duncan Hull alerted me to his paper "Defrosting the Digital Library: Bibliographic Tools for the Next Generation Web" (PloS Computational Biology, doi:10.1371/journal.pcbi.1000204). Here's the abstract:
Many scientists now manage the bulk of their bibliographic information electronically, thereby organizing their publications and citation material from digital libraries. However, a library has been described as “thought in cold storage,” and unfortunately many digital libraries can be cold, impersonal, isolated, and inaccessible places. In this Review, we discuss the current chilly state of digital libraries for the computational biologist, including PubMed, IEEE Xplore, the ACM digital library, ISI Web of Knowledge, Scopus, Citeseer, arXiv, DBLP, and Google Scholar. We illustrate the current process of using these libraries with a typical workflow, and highlight problems with managing data and metadata using URIs. We then examine a range of new applications such as Zotero, Mendeley, Mekentosj Papers, MyNCBI, CiteULike, Connotea, and HubMed that exploit the Web to make these digital libraries more personal, sociable, integrated, and accessible places. We conclude with how these applications may begin to help achieve a digital defrost, and discuss some of the issues that will help or hinder this in terms of making libraries on the Web warmer places in the future, becoming resources that are considerably more useful to both humans and machines.
It's an interesting read, and it also <shamless plug>cites my bioGUID project</shamless plug>.
The HTML isn't much to look at, the real goodness is the JSON (obtained by appending "&display=json" to the OpenURL request, or ".json" to the short form, e.g. http://bioguid.info/genbank/DQ502033.json).
The resolver gets the sequence form NCBI, does a little post processing, then displays the result. Postprocesisng includes parsing the latitude and longitude coordinates (something of a mess in GenBank, see my earlier metacrap rant), extracting specimen codes, adding bibliographic GUIDs (such as DOIs, Handles, or URLs), finding uBio namebankID's for hosts, etc. Note that some records have a key called "taxonomic_group". This is to provide clues for resolving museum specimens -- often the DiGIR provider needs to know what kind of taxon you are searching for.
The aim is to have a simple service that returns somewhat cleaned up GenBank records that I (and others) can play with.
Thinking more and more about using Mediawiki (or, more precisely, Semantic Mediawiki) as a platform for storing and querying information, rather than write my own tools completely from scratch. This means I need ways of modelling some relationships between identifiers and objects.
The first is the relationship between document identifiers such as DOIs and metadata about the document itself. One approach which seems natural is to create a wiki page for the identifier, and have that page consist of a #REDIRECT statement which redirects the user to the wiki page on the actual article.
This seems a reasonable solution because:
The user can find the article using the GUID (in effect we replicate the redirection DOIs make use of)
The GUID itself can be annotated
It is trivial to have multiple GUIDs linking to the same paper (e.g., PubMed identifiers, Handles, etc.).
Taxon names present another set of problems, mainly because of homonyms (the same name being give to two or more diferent taxa).The obvious approach is to do what Wikipedea does (e.g., Morus), namely have a disambiguation page that enable the user to choose which taxon they want. For example:
In this example, there are two taxon names Pinnotheres, so the user would be able to choose between them.
For names which had only one corresponding taxon name we would still have two pages (one for the name string, and one for the taxon name), which would be linked by a REDIRECT:
The advantage of this is that if we subsequently discover a homonym we can easily handle it by changing the REDIRECT page to a disambiguation page. In the meantime, users can simply use the name string because they will be automatically redirected to the taxon name page (which will have the actual information about the name, for example, where it was published).
Of course, we could do all of this in custom software, but the more I look at it the power to edit the relationships between objects, as well as the metadata, and also make inferences makes Semantic Mediawiki look very attractive.
Following on from the previous post, I wrote a simpe Mediawiki extension to insert a Google Book into a wiki page. Written in a few minutes, not tested much, etc.
To use this, copy the code below and save in a file googlebook.php in the extensions directory of your Mediawiki installation.
<?php # rdmp
# Google Book extension # Embed a Google Book into Mediawiki # # Usage: # <googlebook id="OCLC:4208784" /> # # To install it put this file in the extensions directory # To activate the extension, include it from your LocalSettings.php # with: require("extensions/googlebook.php");
$wgExtensionFunctions[] = "wfGoogleBook";
function wfGoogleBook() { global $wgParser; # registers the <googlebook> extension with the WikiText parser $wgParser->setHook( "googlebook", "renderGoogleBook" ); }
# The callback function for converting the input text to HTML output function renderGoogleBook( $input, $argv ) { $width = 425; $height = 400;
if (isset($argv["width"])) { $width = $argv["width"]; } if (isset($argv["height"])) { $width = $argv["height"]; }
Now you can add a Google book to a wiki page by adding a <googlebook> tag. For example:
<googlebook id="OCLC:4208784" />
The id gives the book identifier (such as an OCLC number or a ISBN (you need to include the identifier prefix). By defaulot, the book will appear in a box 425 × 400 pixels in size. You can add optional width and height parameters to adjust this.
I've started to come across more taxonomic books in Google Books, such as Catalogue of the specimens of snakes in the collection of the British museum by John Edward Gray. Google books provides a nice widget for embedding views of books. There is tool for generating the Javascript code. Note that in Blogger (which I use to create this blog) you need to make sure that theJavascript occurs on a single line with no line breaks for it to work.
The Javascript used (with linebreaks that must be removed before using) is:
I stumbled across this book whilst searching for the original record for the snake Enhydris punctata. Confusingly, the Catalogue of Life lists this snake as Enhydris punctata GRAY 1849, implying that Gray's original name still stands, whereas in fact it should be Enhydris punctata (Gray, 1849) as the Gray's original name for the snake was Phytolopsis punctata. It's little things like this that drive me nuts, especially as the Catalogue of Life has no obvious, quick means of fixing this (Wiki, anyone?).
I was also interested in using the OCLC numbers a GUID for the book, but there are several to choose from (including two related to the Google Book). Unlike DOIs, a book may have multiple OCLCs (sigh). Still, it's a GUID, and it's resolvable, so it's a start. Hence, one could link GUIDs for the names published in this book to the book itself.
As part of the slow rebuild of bioguid.info, and as part of the Challenge, I've started making an OpenURL resolver for specimens. Partly this is just a wrapper around DiGIR providers, but it's also a response to the lack of GUIDs for specimens. In the same way that I think OpenURL for papers only really makes sense in a world without GUIDs for literature (DOIs pretty much take care of that), given the lack of specimen GUIDs we are left to resolve specimens based on metadata.
For example, the holotype of Pseudacris fouquettei (shown in photo by Suzanne L. Collins, original here) is TNHC 63583. In a digital world, I want the paper describing this taxon, and the specimen(s) assigned to it to be a click away. In this spirit, here is an OpenURL link for the specimen: http://bioguid.info/openurl/?genre=specimen &institutionCode=TNHC &collectionCode=Herps &catalogNumber=63583. Click on this link and you get a page with some very basic information on the specimen. If you want more, append "&display=json" to the URL to get a JSON response.
So, armed with this, TNHC 63583 becomes resolvable, and joining the pieces becomes a little easier.
The latest issue of Wired has an article on DNA barcoding, entitled "A Simple Plan to ID Every Creature on Earth". The article doesn't say much that will be new to biologists, but it's a nice intro to the topic, and some of the personalities involved.
The rather frail nature of biodiversity services (some of the major players have had service breaks in the last few weeks) has prompted me to revisit Dave Vieglais's BigDig and extend it to other services, such as uBio, EOL, and TreeBASE, as well as DSpace repositories and tools such as Connotea.
The result is at http://bioguid.info/status/. The idea is to poll each service once an hour to see if it is online. Eventually I hope to draw some graphs for each service, to get some idea of how reliable it is.
Much of my own work depends on using web sites and services, and I'm constantly frustrated when they go offline (some times for months at a time).
My aim is to be constructive. I well aware that reliability is not easy, and some tools that I've developed myself have disappeared. But I think as a community we need to do a lot better if biodiversity informatics is to deliver on its promise.
The list of service is biased by what I use. I'm also aware that some of the DiGIR provider information is out of date (I basically lifted the list from the BigDig, I'll try and edit this as time allows).
Comments (and requests for adding services) are welcome. There is a comment box at the bottom of the web page, which uses Disqus, a very cool comment system that enables you to keep track of your comments across multiple sites. It also supports OpenID.
D. Ross Robertson has published a paper entitled "Global biogeographical data bases on marine fishes: caveat emptor" (doi:10.1111/j.1472-4642.2008.00519.x - DOI is broken, you can get the article here). The paper concludes:
Any biogeographical analysis of fish distributions that uses GIS data on marine fishes provided by FishBase and OBIS 'as is' will be seriously compromised by the high incidence of species with large-scale geographical errors. A major revision of GIS data for (at least) marine fishes provided by FishBase, OBIS, GBIF and EoL is essential. While the primary sources naturally bear responsibility for data quality, global online providers of aggregated data are also responsible for the content they serve, and cannot side-step the issue by simply including generalized disclaimers about data quality. Those providers need to actively coordinate, organize and effect a revision of GIS data they serve, as revisions by individual users will inevitably lead to confused science (which version did you use?) and a tremendous expenditure of redundant effort. To begin with, it should be relatively easy for providers to segregate all data on pelagic larvae and adults of marine organisms that they serve online. Providers should also include the capacity for users to post readily accessible public comments about the accuracy of individual records and the overall quality of individual data bases. This would stimulate improvements in data quality, and generate 'selection pressures' favouring the usage of better quality data bases, and the revision or elimination of poor-quality data bases. The services provided to the global science community by the interlinked group of online providers of biodiversity data are invaluable and should not be allowed to be discredited by a high incidence of known serious errors in GIS data among marine fishes, and, likely, other marine organisms. (emphasis added)
As I've noted elsewhere on this blog, and as demonstrated by Yesson et al.'s paper on legume records in GBIF (doi:10.1371/journal.pone.0001124) (not cited by Robertson), there are major problems with geographical information in public databases. I suspect there will be more papers like this, which I hope will inspire database providers and aggregators to take the issue seriously. (Thanks to David Patterson for spotting this paper).
Among the many ways to display trees, degree of interest (DOI) trees strike me as one potentially useful way to display trees such as the NCBI taxonomy. For background see, e.g. doi:10.1145/1133265.1133358 (or Google "degree of interest trees").
The thing that would make this really useful is if an application was written that, like Google Earth, supported a simple annotation file format. Hence, users could create their own annotation files (e.g., taxa of a certain size, those with eyes, etc.) and upload those files, creating their own annotation layers, in much the same way as we can load sets of geographical annotations into Google Earth. I think it's this feature which makes Google Earth what it is, so my question is whether we can replicate this for classifications/phylogeny?
Next few weeks will be busy with term starting, kids visiting, and other commitments, so time to jot down some ideas. The first is to have a Wiki for taxonomic names. Bit like Wikispecies, but actually useful, by which I mean useful for working biologists. This would mean links to digital literature (DOIs, Handles, etc.), use of identifiers for names and taxa (such as NCBI taxids, LSIDs, etc.), and having it pre-populated with data. Imagine merging the NCBI taxonomy, Catalogue of Life, Index Fungorum, and IPNI, say, and having it automatically updated with sources such as WoRMS and uBio RSS. Why a Wiki? Well, partly to capture all the little textual annotations that are needed to flesh out the taxonomy, and partly to make it easy to correct the numerous mistakes that litter existing databases.
As an initial target, I'd aim for a comprehensively annotated NCBI taxonomy, as this is probably the most important taxonomic database that we have.
Just to provide a sense of how much data I want to analyse for the Challenge, I have the XML, PDF, and images for 1687 articles from Molecular Phylogenetics and Evolution to play with.
Julia Clarke and I were advocating data mining, not entirely successfully. At one point I started ranting about post-phylogenetics (i.e., what do do when we've basically got the tree of life). For a brief moment I thought this might be a cool new term to use, although Googling finds that W. Ford Doolittle has used it in the title of talks given at the Wenner-Gren Foundations International Symposium at Stockholm in 2003, and at Penn State in 2006. However, the 2006 talk title (Postphylogenetics: The Tree of Life in the Light of Lateral Gene Transfer) suggests a different meaning (i.e., there isn't a tree of life to be found). I prefer to think of it in the same sense as "postgenomics" -- now that we have all this information, how can we make the best use of it?
I've been using ISSN's (International Standard Serial Number) to uniquely identify journals, both to generate article identifiers, and as a parameter to send to CrossRef's OpenURL resolver. Recently I've come across journals that change their ISSN, which has fairly catastrophic effects on my lookup tools. For example, the Canadian Journal of Botany has the ISSN 0008-4026, or at least this is what JournalSeek tells me. However, the journal web site tells me that it has been renamed as Botany, with ISSN 1916-2804. The thing is, if I want to look up DOIs for articles published in the Canadian Journal of Botany, I have to use the ISSN for Botany if I want to get a result. Hence, I can't rely on looking up the ISSN for the Canadian Journal of Botany. I've come across this in other journals as well.
WorldCat's xISSN web services provide some tools to help, including a graphical display of the history of a journal and it's ISSN(s). Here is the history for 1916-2790, redrawn using Graphviz. WorldCat use Webdot, which I've written about earlier. If you view the source of the WorldCat page you can get the link to the original dot file.
The problem with these changes is that it makes ISSN's more fragile. Ideally, the original ISSN would be preserved, and/or CrossRef would have a table mapping old ISSN's onto new ones. The rate things are going, I may have to create such a table myself.
Starting to get serious about the Grand Challenge. First step is to parse the XML data Elsevier made available. Sadly this is only for Molecular Phylogenetics and Evolution for 2007, I would have liked the whole journal in XML to avoid hassles with parsing PDF. However, XML is not without it's own problems. I'm slowly getting my head around Elsevier's XML (which is, it has to be said, documented in depth). Two tools I find invaluable are the oXygen XML editor, and Marc Liyanage's TextXSLT application.
As a first attempt, I'm converting Elsevier XML into JSON (being a much simpler format to handle). I'm just after what I regard as the core data, namely the bibliography, and the tables (rich with GenBank accession numbers, specimen codes, and geocoordinates). There are a few "gotchas", such as misisng namespaves to add, and HTML entities that need to be added. Then there's the fact that the XML describes both the document content and it's presentation. Tables can get complicated (cells can span more than one row or column), which makes tasks such as identifying cell contents by using the heading of the corresponding column a bit harder. I hope to put a XSLT style sheet online once I'm happy that it can handle most, if not all the tables I've come across. Then the fun of trying to extract the information can begin.