iPhylo: data cleaning

Roderic D. M. Page

Showing posts with label data cleaning. Show all posts

Thursday, February 03, 2022

Deduplicating bibliographic data

There are several instances where I have a collection of references that I want to deduplicate and merge. For example, in Zootaxa has no impact factor I describe a dataset of the literature cited by articles in the journal Zootaxa. This data is available on Figshare (https://doi.org/10.6084/m9.figshare.c.5054372.v4), as is the equivalent dataset for Phytotaxa (https://doi.org/10.6084/m9.figshare.c.5525901.v1). Given that the same articles may be cited many times, these datasets have lots of duplicates. Similarly, articles in Wikispecies often have extensive lists of references cited, and the same reference may appear on multiple pages (for an initial attempt to extract these references see https://doi.org/10.5281/zenodo.5801661 and https://github.com/rdmpage/wikispecies-parser).

There are several reasons I want to merge these references. If I want to build a citation graph for Zootaxa or Phytotaxa I need to merge references that are the same so that I can accurate count citations. I am also interested in harvesting the metadata to help find those articles in the Biodiversity Heritage Library (BHL), and the literature cited section of scientific articles is a potential goldmine of bibliographic metadata, as is Wikispecies.

After various experiments and false starts I've created a repository https://github.com/rdmpage/bib-dedup to host a series of PHP scripts to deduplicate bibliographics data. I've settled on using CSL-JSON as the format for bibliographic data. Because deduplication relies on comparing pairs of references, the standard format for most of the scripts is a JSON array containing a pair of CSL-JSON objects to compare. Below are the steps the code takes.

Generating pairs to compare

The first step is to take a list of references and generate the pairs that will be compared. I started with this approach as I wanted to explore machine learning and wanted a simple format for training data, such as an array of two CSL-JSON objects and an integer flag representing whether the two references were the same of different.

There are various ways to generate CSL-JSON for a reference. I use a tool I wrote (see Citation parsing tool released) that has a simple API where you parse one or more references and it returns that reference as structured data in CSL-JSON.

Attempting to do all possible pairwise comparisons rapidly gets impractical as the number of references increases, so we need some way to restrict the number of comparisons we make. One approach I've explored is the “sorted neighbourhood method” where we sort the references 9for example by their title) then move a sliding window down the list of references, comparing all references within that window. This greatly reduces the number of pairwise comparisons. So the first step is to sort the references, then run a sliding window over them, output all the pairs in each window (ignoring in pairwise comparisons already made in a previous window). Other methods of "blocking" could also be used, such as only including references in a particular year, or a particular journal.

So, the output of this step is a set of JSON arrays, each with a pair of references in CSL-JSON format. Each array is stored on a single line in the same file in line-delimited JSON (JSONL).

Comparing pairs

The next step is to compare each pair of references and decide whether they are a match or not. Initially I explored a machine learning approach used in the following paper:

Wilson DR. 2011. Beyond probabilistic record linkage: Using neural networks and complex features to improve genealogical record linkage. In: The 2011 International Joint Conference on Neural Networks. 9–14. DOI: 10.1109/IJCNN.2011.6033192

Initial experiments using https://github.com/jtet/Perceptron were promising and I want to play with this further, but I deciding to skip this for now and just use simple string comparison. So for each CSL-JSON object I generate a citation string in the same format using CiteProc, then compute the Levenshtein distance between the two strings. By normalising this distance by the length of the two strings being compared I can use an arbitrary threshold to decide if the references are the same or not.

Clustering

For this step we read the JSONL file produced above and record whether the two references are a match or not. Assuming each reference has a unique identifier (needs only be unique within the file) then we can use those identifier to record the clusters each reference belongs to. I do this using a Disjoint-set data structure. For each reference start with a graph where each node represents a reference, and each node has a pointer to a parent node. Initially the reference is its own parent. A simple implementation is to have an array index by reference identifiers and where the value of each cell in the array is the node's parent.

As we discover pairs we update the parents of the nodes to reflect this, such that once all the comparisons are done we have a one or more sets of clusters corresponding to the references that we think are the same. Another way to think of this is that we are getting the components of a graph where each node is a reference and pair of references that match are connected by an edge.

In the code I'm using I write this graph in Trivial Graph Format (TGF) which can be visualised using a tools such as yEd.

Merging

Now that we have a graph representing the sets of references that we think are the same we need to merge them. This is where things get interesting as the references are similar (by definition) but may differ in some details. The paper below describes a simple Bayesian approach for merging records:

Councill IG, Li H, Zhuang Z, Debnath S, Bolelli L, Lee WC, Sivasubramaniam A, Giles CL. 2006. Learning Metadata from the Evidence in an On-line Citation Matching Scheme. In: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries. JCDL ’06. New York, NY, USA: ACM, 276–285. DOI: 10.1145/1141753.1141817.

So the next step is to read the graph with the clusters, generate the sets of bibliographic references that correspond to each cluster, then use the method described in Councill et al. to produce a single bibliographic record for that cluster. These records could then be used to, say locate the corresponding article in BHL, or populate Wikidata with missing references.

Obviously there is always the potential for errors, such as trying to merge references that are not the same. As a quick and dirty check I flag as dubious any cluster where the page numbers vary among members of the cluster. More sophisticated checks are possible, especially if I go down the ML route (i.e., I would have evidence for the probability that the same reference can disagree on some aspects of metadata).

Summary

At this stage the code is working well enough for me to play with and explore some example datasets. The focus is on structured bibliographic metadata, but I may simplify things and have a version that handles simple string matching, for example to cluster together different abbreviations of the same journal name.

Wednesday, August 14, 2013

Cluster maps, papaya plots, and the trouble with GBIF taxonomy

Continuing the theme of the failings of the GBIF classification I've been playing further with cluster maps to visualise the problem (see this earlier post for an introduction).

Browsing through bats in GBIF I keep finding the same species appearing more than once, albeit in different genera. As discussed in the gibbon example, GBIF merges several competing classifications for mammals, and these often don't agree on the "accepted name" for a species. In the absence of a decent database of taxonomic synonyms, GBIF ends up duplicating species, and each duplicate is often associated with different occurence data. If you are trying to get the distribution for a species this can be a disaster.

To get a sense of the scale of the problem I put together a simple tool to create cluster maps. The code is on github) and there is a live service at http://iphylo.org/~rpage/cluster-map/. The service takes a simple tab-delimited file that lists sets and their members, computes the overlap between the sets, calls Graphviz to layout a graph in SVG, then draws in the members of each cluster (phew).

The input file looks something like this:


Molossops	aequatorianus
Chaerephon	aloysiisabaudiae
Tadarida	aloysiisabaudiae
Chaerephon	ansorgei
Tadarida	ansorgei
Molossus	ater
Mormopterus	petrophilus
Sauromys	petrophilus

What can we do with this tool? Well, I created a quick list of all the species of bat in the family Molossidae according to GBIF. The sets are the bat genera, the members are the species (you can see the file here). I then ran this through the cluster map, and got something like this (this is only part of the cluster map):

Bats

(now can you see why I call these "papaya plots"?). Note that there are species names (i.e., specific epithets) in common to more than one genus. Some of these may be perfectly OK (it's not unusual for the same epithet to be used in different species, e.g. "major", etc.). But in many cases these bat species turn out to be the same species, just in different genera in different classifications. For example, GBIF has both Cynomops greenhalli and Molossops greenhalli. These are the same thing. Species in the genus Mormopterus may also occur in other genera. In some cases the issue is competing classifications, sometimes it is conflict over whether a species is a species or merely a subspecies, and some generic conflicts are because some genera are relegated to subgeneric status in some classifications. In short, it's an unholy mess.

Does this matter? Well, consider Mormopterus petrophilus and Sauromys petrophilus, which GBIF both regard as valid species (they're the same thing). Here are the distributions for the two different names in GBIF:

Depending on which name you use you'll get a very different picture of the distribution of this bat.

The next step is to figure out how to fix this. Is there a way we can automate fixing the GBIF classification so that it is not riddled with spurious duplicates like these?

Thursday, August 01, 2013

A use case for RDF in taxonomy

Readers of this blog will know that I'm sceptical about the current value of linked data and RDF in biodiversity informatics. But I came across an interesting paper on RDF and biocuration that suggests a good "use case" for RDF in constructing and curating taxonomic databases.

The paper is "Catching inconsistencies with the semantic web: a biocuration case study" (PDF here) by Jerven Bolleman and Sebastien Gehant. The basic idea is that errors in databases (in this case, UniProt) can be flagged by constructing queries in SPARQL that return results if there is a problem (for example if a sequence annotation is contradictory).

In recent posts I've been complaining about errors in the GBIF taxonomy, notably duplicate taxa that are synonyms. One way to tackle this would be to develop a set of SPARQL queries that we could use to flag potential problems. For example, if two names are objective synonyms then only one of them should be a node in the GBIF classification. If both exist then we have a problem. If we know a name is a homonym of an older name, but that name exists in the GBIF classification, then we could flag that as an issue. We could also construct queries that flag possible problems, even if we don't have precise information on synonymy. For example, in this post I noted that several frog species appear twice in the GBIF classification because GBIF has aggregated classifications that put these frogs in different genera. We could catch such cases by constructing a query to check whether the same species name (specific epithet) appeared in different genera within the same family.

The advantage of using RDF and SPARQL in this context is that that the queries are portable. Assuming everyone uses the same vocabulary (e.g., the TDWG LSID vocabularies) then queries can be constructed by one person (e.g., me) and then used by anyone who has their data in a triple store. We could develop a set of "taxonomy tests" that anyone could apply to their database.

This idea needs some more work, but it would be fun to play with some data and see how many kinds of errors or issues we can catch in this way.

Monday, April 22, 2013

BioNames update - reconciliation strategies

Over on Google Plus (yeah, me neither) Donat Agosti is giving me a hard time regarding the quality of some data that I am using. I've responded to Donat directly, but here I just want to quickly outline two different approaches to cleaning and reconciling bibliographic metadata.

The problem addressed by Donat is the issue of multiple strings for the same journal (e.g., the plethora of different abbreviations and permutations people use to refer to the same journal). In trying to make sense of this mess there are a couple of strategies we can use. One is to cluster the strings into sets that we think refer to the same thing, e.g.:

We could then synthesise the preferred journal name from this set. We could make some sort of consensus string, for example. There are also some quite nice Bayesian methods for combining contradictory metadata.

Another approach, which I use, is to map the strings to a third party identifier, in this case an ISSN:

Once I've done this I can use the identifier to refer to the journal, hence ultimately I don't particularly care what string is best for the journal (indeed, I can defer to a third party for this decision).

The point is obsessing with clean, "correct" bibliographic metadata is something of a fool's errand. Obviously, it's nice to have clean metadata if you can get it, but in many cases there is no exact answer to what is the correct metadata. Some journals have multiple names (e.g., in different languages), some run different volume numbering schemes in parallel, and date of publication can be rather problematic (see my Mendeley group on publication dates). If we can map a publication to a globally unique identifier, such as a DOI, then we can sidestep this issue and focus on what I think really matters - linking data together.

Monday, June 25, 2012

More fictional taxa and the myth of the expert taxonomic database

I know I'm starting to sound like a broken record, but the more I look, the more taxonomic databases seem to be full of garbage. Databases such as the Catalogue of life, which states that it is a "quality-assured checklist" have records that are patently wrong. Here's yet another example.

If you search for the genus Raymondia in the Catalogue of Life you get multiple occurrences of the same species names, e.g.:

Both of these are listed as "provisionally accepted names", supplied by WTaxa: Electronic Catalogue of Weevil names (Curculionoidea). Clearly we can't have two species with the same name, so what's happening?

Firstly, Hustache, A., 1930 is:

Hustache A (1930) Curculionidae Gallo-Rhénans. Annales de la Société entomologique de France 99: 81-272. http://gallica.bnf.fr/ark:/12148/bpt6k6112240j/f3

On p. 246 Hustache refers to Raymondionymus fossor Aubé, 1864 (see below).

F168 highres

So, Raymondionymus fossor Hustache, A., 1930 is not a new species but simply the citation of a previously published one (it's a chresonym). Hustache cites the author of the name as Aubé, 1864, and you can see the original description by Aubé in BioStor (Description de six espèces nouvelles de Coléoptères d'Europe dont deux appartenant a deux genres nouveaux et aveugles, http://biostor.org/reference/104589). So, if the taxonomic authority should be Aubé, 1864, what about Raymondionymus fossor Ganglebauer, L., 1906? Again, if we track down the original publication (Revision der Blindrüsslergattungen Alaocyba und Raymondionymus, http://biostor.org/reference/104591) it's simply Ganglebauer citing (on p. 142) Aubé's paper, not describing a new species.

Note that the nomenclature of this weevil species is further complicated because Aubé originally described the species as Raymondia fossor, but Raymondia was already in use for a fly (see Über eine neue Fliegengattung: Raymondia, aus der Familie der Coriaceen, nebst Beschreibung zweier Arten derselben, http://biostor.org/reference/104588). To resolve this homonymy Wollaston proposed the name Raymondionymus:

Wollaston, T. V. (1873). XVIII. On the Genera of the Cossonidae. Transactions of the Royal Entomological Society of London, 21(4), 427–652. doi:10.1111/j.1365-2311.1873.tb00645.x http://biostor.org/reference/51301

So, we have a bit of a mess. Unfortunately this mess percolates up through other databases, for example EOL has three different pages for Raymondionymus fossor.

For me the lesson here is that relying on acquiring data from "trusted" sources, curated by "experts" is simply not a tenable strategy for building lists of taxa. If names are essential bits of biodiversity infrastructure upon which we hang other data, then these lists need to be cleaned, which means exposing them to scrutiny, and providing an easy means for errors to be flagged and corrected. Trust is something that is earned, not asserted, and it's time taxonomic databases stop claiming to be authoritative simply because they rely on expert sources. Expertise is no guarantee that you won't make errors.

For me this is one of the key reasons projects like BHL are so important. As more and more of the original literature becomes available, we lessen our reliance on "expertise". We can start to see for ourselves. In other words, "Nullius in verba" ("take nobody's word for it").

Wednesday, May 30, 2012

The GBIF classification is broken — how do we fix it?

This post arose from an ongoing email conversation with Tony Rees about extracting and annotating taxonomic names. In BioStor I use the GBIF classification to display the taxonomic names found in the OCR text in the form of a tree. The idea is to give the reader a sense of "what the paper is about". I also use the classification to help link to GBIF occurrence records.

The GBIF backbone classification ("nub") is probably the single largest classification of life that has been assembled, and provides GBIF users with a way to navigate through GBIF's collection of specimen and observation records. Given the scale of the undertaking it is inevitable that there will be issues with the classification, and this post provides one example.

On the page for the article "Further additions to the known marine Molluscan fauna of St. Helena" (http://biostor.org/reference/88554, see also http://dx.doi.org/10.1080/00222939208677383) part of the classification looks like this:


└Animalia
 └Annelida
  └Polychaeta
   └Sabellida
    └Serpulidae
     └Hipponyx

Tony points out that "Hipponyx" is a mollusc, yet in the GBIF classification appears in the annelid worms.

Like a fool I started to investigate further. First off, what is "Hipponyx"? Browsing the GBIF classification there are species of Hipponyx and Hipponix under the genus Hipponix, so it looks like we have two alternative spellings of this genus name. Nomenclator Zoologicus has both spellings, Hipponix credited to DeFrance 1819 Journ. de Physique, 88, 217, and Hipponyx credited to Defrance 1819 Bull. Sci. Soc. philom. Paris, 8. Gotta love those cryptic citations. After some digging around in BHL I found Journ. de Physique, 88, 217 (Mémoire sur un nouveau genre de mollusque) and Bull. Sci. Soc. philom. Paris, 8. (Sur un nouveau genre de coquilles (Hipponix)). Both papers are by Jacques Louis Marin DeFrance, and both use the spelling Hipponix (no 'y'). I'm guessing the second paper is actually the original description of the genus, but my French is abysmal (Google Translate to the rescue).

OK, so we have two spellings of what is probably the same thing (and I've no idea why we have two spellings). Both spellings seem in use (see Google NGrams chart below).

So, bit of a mess, but this still doesn't deal with Hipponyx being a worm in GBIF. After a bit of Googling on "Serpulidae" and "Hipponyx" I came across a specimen record from Te Papa labelled "Worm, Temporaria inexpectata (Mestayer, 1929); holotype; holotype of Hipponyx inexpectata Mestayer, 1929". I then came across this paper:

Fleming, C. A. (1971). A preliminary list of New Zealand fossil polychaetes. New Zealand Journal of Geology and Geophysics, 14(4), 742–756. doi:10.1080/00288306.1971.10426332

with the following abstract:

An annotated list of fossil “worm tubes” from New Zealand includes both published and new records from Mesozoic and Cenozoic deposits.

The binomen Zoophycos plicatus (Hutton) is proposed for the trace fossil long known as the Amuri fucoid, of unknown zoological affinity.

The following living species are recorded as New Zealand fossils for the first time: Protula bispiralis (Savigny), Salmacina dysteri (Huxley), Hydroides norvegicus Gunnerus, Pomatoceras cariniferus (Gray), P. aff. terranovae (Benham), Galeolaria hystrix (Moerch), Boccardia ? polybranchia (Haswell); new records of fossil species are Ditrupa cf. plana (Sowerby), Dorsoserpula lumbricalis (Schlotheim), and Neomicrorbis crenatostriatus (Münster). The name Hipponyx inexpectata Mestayer 1929, applied to a serpulid operculum, is used in the combination Temporaria inexpectata for a tubeworm common in deep water off New Zealand that has also been identified, with associated operculum, from the bathyal Waitotaran (Pliocene) sediments of Palliser Bay. Serpula wharjensis Wilkens and S. ougenensis Chapman are placed in Sclerostyla Moerch. Two species of Vermiliopsis and two of Spirorbis are figured but not named specifically.

The author of the paper (Charles Fleming) argues that Hipponyx inexpectata, regarded as a mollusc by its describer (Marjorie K. Mestayer, see Notes on New Zealand Mollusca. No. 4.) is actually a worm, and he moves it to the genus Temporaria.

So it seems that the reason Hipponyx has ended up being a worm in the GBIF classification is due to this synonymy.

Now, this little investigation was "fun", but took a couple of hours. Much of that was spent tracking down the literature and adding it to BioStor, which is a one-time cost. Not every issue with the GBIF classification will take this long to resolve, some cases may take longer. So there's a problem of scalability. Then there's the issue of how this information gets into the GBIF classification so we fix it (and so that people don't think Hipponyx is a worm). As has been said several times before, most eloquently by David Shorthouse, isn't it time we started using software development tools such as version control to help build, annotate, and correct classifications such as the one that underpins GBIF? That way when somebody spots an error it can be flagged, and someone with the time (and curiosity) can fix it.

Wednesday, February 22, 2012

Clustering strings

Revisiting an old idea (Clustering taxonomic names) I've added code to cluster strings into sets of similar strings to the phyloinformatics course site.

This service (available at http://iphylo.org/~rpage/phyloinformatics/services/clusterstrings.php) takes a list of strings, one per line, and returns a list of clusters. For example, given the names


Ferrusac 1821
Bonavita 1965
Ferussa 1821
Fer.
Lamarck 1812
Ferussac 1821

the service finds three clusters, displayed here using Google images:

(Note to self, investigate canviz as an alternative for displaying graphviz graphs.)

If you are curious, these strings are taxonomic authorities associated with the name Helicella, and based on this clustering there are three taxonomic names, one of which has three different variations of the author's name.

Monday, February 06, 2012

Using Google Refine and taxonomic databases (EOL, NCBI, uBio, WORMS) to clean messy data

Google Refine is an elegant tool for data cleaning. One of its most powerful features is the ability to call "Reconciliation Services" to help clean data, for example by matching names to external identifiers. Google Refine comes with the ability to use Freebase reconciliation services, but you can also add external services. Inspired by this I've started to implement services to reconcile taxonomic names.

The services I've implemented so far are:

EOL http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_eol.php
NCBI taxonomy http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_ncbi.php
uBio FindIT http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_ubio.php
WORMS http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_worms.php
GBIF http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_gbif.php
Global Names Index http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_globalnames.php

To use these you need to add the URLs above to Google Refine (see example below). The EOL, NCBI and WORMS do a basic name lookup. The uBio FindIT service extracts a taxonomic name from a string, and can be viewed as a "taxonomic name cleaner".

How to use reconciliation services

Start a Google Refine session. Save the names below to a text file and open it as a new project.


Names
Achatina fulica (giant African snail)
Acromyrmex octospinosus ST040116-01
Alepocephalus bairdii (Baird's smooth-head)
Alaska Sea otter (Enhydra lutris kenyoni)
Toxoplasma gondii
Leucoagaricus gongylophorus
Pinnotheres
Themisto gaudichaudii
Hyperiidae

You should see something like this:
Refine1

Click on the column header Names and choose Reconcile → Start reconciling.

Refine2

A dialog will popup asking you to select a service.

Refine3

If you've already added a service it will be in the list on the left. If not, click the Add Standard Services... button at the bottom left and paste in the URL (in this case http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_ubio.php).

Once the service has loaded click on Start Reconciling. Once it has finished you should see most of the names linked to uBio (click on a name to check this):

Refine4

Sometimes there may be more than one possible match, in which case these will be listed in the cell. Once you have reconciled the data you may want to do something with the reconciliation. For example, if you want to get the ids for the names you've just matched you can create a new column based on the reconciliation. Click on the Names column header and choose Edit column → Add column based on this column.... A dialog box will be displayed:

Refine6

In the box labelled Expression enter cell.recon.match.id and give the column a name (e.g., "NamebankID"). You will now have a column of uBio NamebankIDs for the names:

Refine7

You could also get the names uBio extracted by creating a column based on the values of cell.recon.match.name. To compare this with the original values, click on the Names column header and choose Reconcile → Actions → Clear reconciliation data. Now you can see the original input names, and the string uBio extracted from each name:

Refine8

These are some very simple ideas for using Google Refine with taxonomic name services. Obvious extensions would to use services that provide an "accepted name", or services that support approximate string matching so you could catch spelling mistakes (most of the services I've implemented here have some degree of support for these features).

Development notes
The code for these services is in Github (undocumented as yet, that's on the to do list). I had a few hiccups getting these services to work. There is detailed documentation at http://code.google.com/p/google-refine/wiki/ReconciliationServiceApi, but this seems a little out of step with what actually happens. Based on the documentation I thought Google Refine called a reconciliation service using HTTP GET, but in fact it uses POST. Google Refine always called my reconciliation service using "Multiple Query Mode", which meant supporting this mode wasn't optional. Once these issues were sorted out (turning on the Java console as per David Huynh's tip helped) things work pretty well.

Wednesday, November 10, 2010

Mendeley mangles my references: phantom documents and the problem of duplicate references

One issue I'm running into with Mendeley is that it can create spurious documents, mangling my references in the process. This appears to be due to some over-zealous attempts to de-duplicate documents. Duplicate documents is the number one problem faced by Mendeley, and has been discussed in some detail by Duncan Hull in his post How many unique papers are there in Mendeley?. Duncan focussed on the case where the same article may appear multiple times in Mendeley's database, which will inflate estimates of how many distinct references the database contains. It also has implications for metrics derived from the Mendeley, such as those displayed by ReaderMeter.

In this post I discuss the reverse problem, combining two or more distinct references into one. I've been uploading large collections of references based on harvesting metadata for journal articles. Although the metadata isn't perfect, it's usually pretty good, and in many cases linked to Open Access content in BioStor. References that I upload appear in public groups listed on my profile, such as the group Proceedings of the Entomological Society of Washington.

Reverse engineering Mendeley
In the absence of a good description by Mendeley of how their tools work, we have to try and figure it out ourselves. If you click on a refernece that has been recently added to Mendeley you get a URL that looks like this: http://www.mendeley.com/c/3708087012/g/584201/magalhaes-2008-a-new-species-of-kingsleya-from-the-yanomami-indians-area-in-the-upper-rio-orinoco-venezuela-crustacea-decapoda-brachyura-pseudothelphusidae/ where 584201 is the group id, 3708087012 is the "remoteId" of the document (this is what it's called in the SQLite database that underlies the desktop client), and the rest of the URL is the article title, minus stop words.

After a while (perhaps a day or so) Mendeley gets around to trying to merge the references I've added with those it already knows about, and the URLs lose the group and remoteId and look like this: http://www.mendeley.com/research/review-genus-saemundssonia-timmerman-phthiraptera-philopteridae-alcidae-aves-charadriiformes-including-new-species-new-host/ . Let's call this document the "canonical document" (this document also has a UUID, which is what the Mendeley API uses to retrieve the document). Once the document gets one of these URLs Mendeley will also display how many people are "reading" that document, and whether anyone has tagged it.

But that's not my paper!
The problem is that sometimes (and more often than I'd like) the canonical document bears little relation to the document I uploaded. For example, here is a paper that I uploaded to the group Proceedings of the Entomological Society of Washington:

Review of the genus Saemundssonia Timmermann (Phthiraptera: Philopteridae) from the Alcidae (Aves: Charadriiformes), including a new species and new host records by Roger D Price, Ricardo L Palma, Dale H Clayton, Proceedings of the Entomological Society of Washington, 105(4):915-924 (2003).

You can see the actual paper in BioStor: http://biostor.org/reference/57185. To see the paper in the Mendeley group, browse it using the tag Phthiraptera:

Note the 2, indicating that two people (including myself) have this paper in their library. The URL for this paper is http://www.mendeley.com/research/review-genus-saemundssonia-timmerman-phthiraptera-philopteridae-alcidae-aves-charadriiformes-including-new-species-new-host/, but this is not the paper I added!.

What Mendeley displays for this URL is this:

Not only is this not the paper I added, there is no such paper! There is a paper entitled "A new genus and a new species of Daladerini (Hemiptera: Heteroptera: Coreidae) from Madagascar", but that is by Harry Brailovsky, not Clayton and Price (you can see this paper in BioStor as http://biostor.org/reference/55669). The BioStor link for the phantom paper displayed by Mendeley, http://biostor.org/reference/55761, is for a third paper "A review of ground beetle species (Coleoptera: Carabidae) of Minnesota, United States : New records and range extensions". The table below shows the original details for the paper, the details for the "canonical paper" created by Mendeley, and the details for two papers that have some of the bibliographic details in common with this non-existent paper (highlighted in bold).

Field	Original paper	Mendeley
Title	Review of the genus Saemundssonia Timmermann (Phthiraptera: Philopteridae) from the Alcidae (Aves: Charadriiformes), including a new species and new host records	A new genus and a new species of Daladerini (Hemiptera: Heteroptera: Coreidae) from Madagascar	A new genus and a new species of Daladerini (Hemiptera: Heteroptera: Coreidae) from Madagascar	A review of ground beetle species (Coleoptera: Carabidae) of Minnesota, United States : New records and range extensions
Author(s)	Roger D Price, Ricardo L Palma, Dale H Clayton	DH Clayton, RD Price	Harry Brailovsky
Volume	105	105	104	107
Pages	915-924	915-924	111-118	917-940
BioStor	57185	55761	55669	55761

As you can see it's a bit of a mess. Now, finding and merging duplicates is a hard problem (see doi:10.1145/1141753.1141817 for some background), but I'm struggling to see why these documents were considered to be duplicates.

What I'd like to see
I'm a big fan of Mendeley, so I'd like to see this problem fixed. What I'd really like to see is the following:

Mendeley publish a description of how their de-duplication algorithms work

Mendeley describe the series of steps a document goes through as they process it (if nothing else, so that users can make sense of the multiple URLs a document may get over it's lifetime in Mendeley).

For each canonical reference Mendeley shows the the set of documents that have been merged to create that canonical reference, and display some measure of their confidence that the match is genuine.

Mendeley enables users to provide feedback on a canonical document (e.g., a button by each document in the set that enables the user to say "yes this is a match" or "no, this isn't a match").

Perhaps what would be useful is if Mendeley (or the community) assemble a test collection of documents which contains duplicates, together with a set of the canonical documents this collection actually contains, and use this to evaluate alternative algorithms for finding duplicates. Let's make this a "challenge" with prizes! In many ways I'd be much more impressed by a duplication challenge than the DataTEL challenge, especially as it seems clear that Mendeley readership data is too sparse to generate useful recommendations (see Mendeley Data vs. Netflix Data).

Friday, October 23, 2009

n-gram fulltext indexing in MySQL

Continuing with my exploration of the Biodiversity Heritage Library one obstacle to linking BHL content with nomenclature databases is the lack of a consistent way to refer to the same bibliographic item (e.g., book or journal). For example, the Amphibia Species of the World (ASW) page for Gastrotheca aureomaculata gives the first reference for this name as:

Gastrotheca aureomaculata Cochran and Goin, 1970, Bull. U.S. Natl. Mus., 288: 177. Holotype: FMNH 69701, by original designation. Type locality: "in [Departamento] Huila, Colombia, at San Antonio, a small village 25 kilometers west of San Agustín, at 2,300 meters".

The journal that ASW abbreviates as "Bull. U.S. Natl. Mus." is in the BHL, which gives its title as "Bulletin - United States National Museum.". How do I link these two records? In my bioGUID OpenURL project I've been doing things like using SQL LIKE statements with periods (.) replaced by wildcards ('%') to find journal titles that match abbreviations (as well as building a database of these abbreviations). But this is error prone, and won't work for abbreviations such as "Bull. U.S. Natl. Mus." because the word "National" has been abbreviated to "Natl", which isn't a substring of "National".

After exploring various methods (including longest common subsequences, and sequence alignment algorithms) I came across a MySQL plugin for n-grams. The plugin tokenises strings into bi-grams (tokens with just two characters, see the Wikipedia page on N-grams for more information). This means that even though as words "National" and "Natl" are different, they will have some similarity due to the shared bi-grams "Na" and "at".

So, I grabbed the source for the plugin and the ICU dependency, compiled the plugin and added it to MySQL (I'm running MySQL 5.1.34 on Mac OS X 10.5.8). The plugin can be added while the MySQL server is running using this SQL command:


INSTALL PLUGIN bigram SONAME 'libftbigram.so';

Initial experiments seem promising. For the bhl_title table I created a bi-gram index:


ALTER TABLE `bhl_title` ADD FULLTEXT (`ShortTitle`) WITH PARSER bigram;

If I then take the abbreviation "Bull. U.S. Natl. Mus.", strip out the punctuation, and search for the resulting string ("Bull US Natl Mus")


SELECT TitleID, ShortTitle, MATCH(ShortTitle) AGAINST('Bull U S Natl Museum')
AS score FROM bhl_title
WHERE MATCH(ShortTitle) AGAINST('Bull U S Natl Museum') LIMIT 5;

I get this:

TitleID	ShortTitle	score
7548	Bulletin - United States National Museum.	19.4019603729248
13855	Bulletin du Muséum National d'Histoire Naturelle.	17.6493873596191
14964	Bulletin du Muséum National d'Histoire Naturelle.	17.6493873596191
5943	Bulletin du Muséum national d'histoire naturelle.	17.6493873596191
12908	Bulletin du Muséum National d'Histoire Naturelle.	17.6493873596191

The journal we want is the top hit (if only just). I'll probably have to do some post-processing to check that the top hit makes sense (e.g., is it a supersequence of the search term?) but this looks like a promising way to match abbreviated journal names and book titles to records in BHL (and other databases).