Sunday, November 23, 2014

Automatically extracting possible taxonomic synonyms from the literature

Quick notes on an experimental feature I've added to BioNames. It attempts to identify possible taxonomic synonyms by extracting pairs of names with the same species name that appear together on the same page of text. The text could be full text for an open access article, OCR text from BHL, or the title and abstract for an article. For example, the following paper creates a new combination, Hadwenius tursionis, for a parasite of the bottlenose dolphin. This name is a synonym of Synthesium tursionis.

Fernández, M., Balbuena, J. A., & Raga, J. A. (1994, July). Hadwenius tursionis (Marchi, 1873) n. comb. (Digenea, Campulidae) from the bottlenose dolphin Tursiops truncatus (Montagu, 1821) in the western Mediterranean. Syst Parasitol. Springer Science + Business Media. doi:10.1007/bf00009519

The taxonomic position of Synthesium tursionis (Marchi, 1873) (Digenea, Campulidae) is revised, based on material from 147 worms from four bottlenose dolphins Tursiops truncatus stranded off the Comunidad Valenciana (Spanish western Mediterranean). The species is transferred to Hadwenius, as H. tursionis n. comb., and characterised by a high length/width ratio of the body, spinose cirrus and unarmed metraterm. Synthesium, a monotypic genus, becomes a synonym of Hadwenius. The intraspecific variation of some morphological traits is briefly discussed.

If we extract taxonomic names from the title and abstract we have the pair (Synthesium tursionis, Hadwenius tursionis). If we do this across all the text currently in BioNames then we discover other pairs of names that include Synthesium tursionis, joining these together we can create a graph of co-occurrence of names that are synonyms (see Synthesium tursionis).

Synthesium tursionisHadwenius tursionisDicrocoelium tursionisDistomum tursionisOrthosplanchnus tursionisSynthesium (Orthosplanchnus) tursionis
These graphs are computed automatically, and there is inevitably scope for error. Taxa that are not synonyms may have the same specific name (e.g., parasites and hosts may have the same specific name), and some of the names extracted from the text may be erroneous. At the same time, anecdotally it is a useful way to discover links between names. Even better, this approach means that we have the associated evidence for each pair of names. The interface in BioNames lists the references that contain the pairs of names, so you can evaluate the evidence for synonymy. It would be useful to try and evaluate the automatically detected synonyms by comparisons with existing lists of synonyms (e.g., from GBIF).