Showing posts with label BioNames. Show all posts
Showing posts with label BioNames. Show all posts

Friday, August 26, 2016

Displaying original species descriptions in BioNames

B8e253dc3be3d84f2c69c51b0af86c03 400x400The goal of my BioNames project is to link every taxonomic name to its original description (initially focussing on animal names). The rationale is that taxonomy is based on evidence, and yet most of this evidence is buried in a non-digitised and/or hard to find literature. Surfacing this information not only makes taxonomic evidence accessible (see Surfacing the deep data of taxonomy), it also surfaces a lot of basic biological information. In many cases the original taxonomic description will be an important source of information about what a species looks like, where it lives, and what it does.

To date I've focussed on linking names to publications, such as articles, on the grounds that this is the unit of citation in science. It's also the unit most often digitised and assigned an identifier, such as a DOI. But often taxonomists cite not an article but the individual page on which the description appears. In web-speak, taxonomists cite "fragment identifiers". Page-level identifiers are not often encountered in the digital world, in part because many digital representations don't have "pages". But this doesn't mean that we can't have identifiers for parts of an article, for example in Fragment Identifiers and DOIs Martin Fenner gives examples of ways to link to specific parts of an online article. His examples work if the article is displayed as HTML. If we are working with XML (say, for a journal published by Pensoft), then we can use XPath to refer to sections of a document. Ultimately it would be nice to have stable identifiers for document fragments linked to taxonomic names, so that we can readily go from name to description (even better if that description was in machine-readable form). You could think of these as locators for "taxonomic treatments", e.g. Miller et al. 2015.

As a quick and dirty approach to this I've reworked BioNames to be able to show the page where a species name is first published. This only works if a number of conditions are met:

  • The BioName database has the page number ("micro reference") for the name.
  • BioNames has the full text for the article, either from BioStor or a PDF.
  • The taxonomic name has been found in that text (e.g., by the Global Names GNRD service).

If these conditions are met, then BioNames will display the page, like this example (Belobranchus segura Keith, Hadiaty & Lord 2012: Screenshot 2016 08 26 16 13 50

Both the page image and OCR text (if available) are displayed. This is a first step towards (a) making stable identifiers available for these pages, and (b) making the text accessible for machine reading.

For some more examples, try Heterophasia melanoleuca kingi Eames 2002 (bird), Echinoparyphium anatis Fischthal & Kuntz 1976 (trematode), Bathymodiolus brooksi Gustafson, Turner, Lutz & Vrijenhoek 1998 (bivalve), Amolops cremnobatus Inger & Kottelat 1998 (frog), Leptothorax caesari Espadaler 1997 (ant), and Daipotamon minos Ng & Trontelj 1996 (crab).

Tuesday, December 01, 2015

Guest post: 10 years of global biodiversity databases: are we there yet?

YtNkVT2UThis guest post by Tony Rees explores some of the themes from his recent talk 10 years of Global Biodiversity Databases: Are We There Yet?.

A couple of months ago I received an invitation to address the upcoming 2015 meeting of the Malacological Society of Australasia (Australian and New Zealand mollusc persons for the uninitiated) on some topic associated with biodiversity databases, and I decided that a decadal review might be an interesting exercise, both for my potential audience (perhaps) and for my own interest (definitely). Well, the talk is delivered and the slides are available on the web for viewing if interested, and Rod has kindly invited me to present some of its findings here, and possibly stimulate some ongoing discussion since a lot of my interests overlap his own quite closely. I was also somewhat influenced in my choice of title by a previous presentation of Rod's from some 5 years back, "Why aren't we there yet?" which provides a slightly pessimistic counterpoint to my own perhaps more optimistic summary.

I decided to construct the talk around 5 areas: compilations of taxonomic names and associated valid/accepted taxa; links to the literature (original citations, descriptions, more); machine-addressable lists of taxon traits; compilations of georeferenced species data points such as OBIS and GBIF; and synoptic efforts in the environmental niche modelling area (all or many species so as to be able to produce global biodiversity as well as single-species maps). Without recapping the entire content of my talk (which you can find on SlideShare), I thought I would share with readers of this blog some of the more interesting conclusions, many of which are not readily available elsewhere, at least not with effort to chase down and/or make educated guesses.

In the area of taxonomic names, for animals (sensu lato) ION has gone up from 1.8m to 5.2m names (2.8m to 3.5m indexed documents) from all ranks (synonyms not distinguished) over the cited period 2005-2015, while Catalogue of Life has gone up from 0.5m species names + ?? synonyms to 1.6m species names + 1.3m synonyms over the same period; for fossils, BioNames database is making some progress in linking ION names to external resources on the web but, at less than 100k such links, is still relatively small scale and without more than a single-operator level of resourcing. A couple of other "open access" biological literature indexing activities are still at a modest level (e.g. 250k-350k citations, as against an estimated biological literature of perhaps 20m items) at present, and showing few signs of current active development (unless I have missed them of course).

Comprehensive databases of taxon traits (in machine addressable form) appear to have started with the author’s own "IRMNG" genus- and species- level compendium which was initially tailored to OBIS needs for simply differentiating organisms into extant vs. fossil, marine vs. nonmarine. More comprehensive indexes exist for specific groups and recently, Encyclopedia of Life has established "TraitBank" which is making some headway although some of the "traits" such as geographic distribution (a bounding box from either GBIF or OBIS) and "number of GenBank sequences" stretch the concept of trait a little (just my two cents' worth, of course), and the newly created linkage to Google searches is to be applauded.

With regard to aggregating georeferenced species data (specimens and observations), both OBIS (marine taxa only) and GBIF (all taxa) have made quite a lot of progress over the past ten years, OBIS increasing its data holdings ninefold from 5.6m to 44.9m (from 38 to 1,900+ data providers) and GBIF more than tenfold from 45m to 577m records over the same period, from 300+ to over 15k providers. While these figures look healthy there are still many data gaps in holdings e.g. by location sampled, year/season, ocean depth, distance to land etc. and it is probably a fair question to ask what is the real "end point" for such projects, i.e. somewhere between "a record for every species" and "a record for every individual of every species", perhaps...

Global / synoptic niche modelling projects known to the author basically comprise Lifemapper for terrestrial species and AquaMaps for marine taxa (plus some freshwater). Lifemapper claims "data for over 100,000 species" but it is unclear whether this corresponds to the number of completed range maps available at this time, while AquaMaps has maps for over 22,000 species (fishes, marine mammals and invertebrates, with an emphasis on fishes) each of which has a point data map, a native range map clipped to where the species is believed to occur, an "all suitable habitat map" (the same unclipped) and a "year 2100 map" showing projected range changes under one global warming scenario. Mapping parameters can also be adjusted by the user using an interactive "create your own map" function, and stacking all completed maps together produces plots of computed ocean biodiversity plus the ability to undertake web-based "what [probably] lives here" queries for either all species or for particular groups. Between these two projects (which admittedly use different modelling methodologies but both should produce useful results as a first pass) the state of synoptic taxon modelling actually appears quite good, especially since there are ongoing workshops e.g. the recent AMNH/GBIF workshop Biodiversity Informatics and Modelling Species Distributions at which further progress and stumbling blocks can be discussed.

So, some questions arising:

  • Who might produce the best "single source" compendium of expert-reviewed species lists, for all taxa, extant and fossil, and how might this happen (my guess: a consortium of Catalogue of Life + PaleoBioDB at some future point)
  • Will this contain links to the literature, at least citations but preferably as online digital documents where available? (CoL presently no, PaleoBioDB has human-readable citations only at present)
  • Will EOL increasingly claim the "TraitBank" space, and do a good enough job of it? (also bearing in mind that EOL is still an aggregator, not an original content creator, i.e. somebody still has to create it elsewhere)
  • Will OBIS and/or GBIF ever be "complete", and how will we know when we’ve got there (or, how complete is enough for what users might require)?
  • Same for niche modelling/predicted species maps: will all taxa eventually be covered, and will the results be (generally) reliable and useable (and at what scale); or, what more needs to be done to increase map quality and reliability.

Opinions, other insights welcome!

Thursday, May 14, 2015

The value of ION to GBIF

Ion hdr homeThis a quick writeup of an analysis I did to make the case that the list of names held by the Index of Organism Names (ION) (part of Thomson Reuters) would be very useful for GBIF. I must declare a bias, in that I've spent a good chunk of the last 3-4 years exploring the ION database and investigating ways to link the taxonomic names it contains to the primary taxonomic literature, culminating in building BioNames.

What makes ION special is its scope (it endeavours to have all names covered by the ICZN), and that many of its names have associated citation information (i.e., details on the publication that published the name). Like any name database it has duplications and errors, and some of the older content is a bit ropey, but it's a tremendous resource and from my perspective nothing else in zoology come close.

But rather than rely on anecdote, I decided to do a quick analysis to see what ION could potentially add to GBIF. I've been doing some work on bird names recently, so as an exercise I searched GBIF for holotype specimens for birds. The search (13 May 2015) returned 11,664 records. I then filtered those on taxonomic names that GBIF could not match exactly (TAXON_MATCH_FUZZY) or names that GBIF could only match to a higher rank (TAXON_MATCH_HIGHERRANK). The query URL is:

http://www.gbif.org/occurrence/search?TAXON_KEY=212 &TYPE_STATUS=HOLOTYPE &ISSUE=TAXON_MATCH_FUZZY &ISSUE=TAXON_MATCH_HIGHERRANK

This query found 6,928 records, so over half the bird holotype specimens in GBIF do not match a taxonomic name in GBIF. What this means is that GBIF can't accurately place these names in its own taxonomic hierarchy. It also makes it hard to do meaningful analyses of things such as "how long does it take before a bird specimen is collected to when it is described as a new species?" because if you can match the name then you can't get the date the name was published.

To explore this further, I downloaded the results of the query (the download has DOI http://doi.org/10.15468/dl.vce3ay). I then wrote a script to parse the specimen records and extract the GBIF occurrence id, catalogue number, and scientific name. I then used the GBIF API to retrieve (where available) the verbatim record for each specimen (using the URL http://api.gbif.org/v1/occurrence//verbatim where is the occurrence id). This gives us the original name on the specimen, which I then looked up in BioNames using its API. If I got a hit I extracted the identifier of the name (the LSID in the ION database) and the corresponding publication id in BioNames (if available). If there was a publication associated with the name I then generated a human-readable citation using BioNames’s citeproc API. The code for all this is on github.

Here's a sample of the mapping:

OccurrenceHolotypeGBIF matched nameVerbatim nameIONBioNamesPublicaton
883603238USNM PAL378357.3368464Porzana Vieillot, 1816Porzana severnsi8796592c4f3...
Olson, S. L., & James, H. F. (1991). Descriptions of thirty-two new species of birds from the Hawaiian Islands: Part 1. Non-Passeriformes. Ornithological Monographs, 45, 1-88. doi:10.2307/40166794
858732312AMNH Skin-245914Otus choliba (Vieillot, 1817)Otus choliba duidae4307811b3315...
Chapman, F. M., & History, T. D. E. of the A. M. of N. (1929). Descriptions of new Birds from Mt. Duida, Venezuela. American Museum Novitates, 380, 1-27. Retrieved from http://hdl.handle.net/2246/3988
858732345AMNH Skin-245936Atlapetes Wagler, 1831Atlapetes duidae4307791b3315...
Chapman, F. M., & History, T. D. E. of the A. M. of N. (1929). Descriptions of new Birds from Mt. Duida, Venezuela. American Museum Novitates, 380, 1-27. Retrieved from http://hdl.handle.net/2246/3988
858733764AMNH Skin-45339Leptotila Swainson, 1837Leptotila gaumeri Lawr.
858744126AMNH Skin-218110Zosterops Vigors & Horsfield, 1827Zosterops alberti ablita

The complete result of this mapping can be viewed here. Of the 6,392 holotypes with names not recognised by GBIF, nearly half (3,165, 49.5%) exactly matched a name in ION. Many of these are also linked to the publication that published that name.

So, adding ION help us find half the missing holotype names. This is before doing anything more sophisticated, such as approximate string matching, resolving synonyms, etc. Hence, I'd argue that the names in ION would add a lot to GBIF's ability to interpret the occurrence records it receives from museums.

I've not had time for further analysis, but at first glance a lot of the missed names are subspecies, the are quite a few fossils, and many names are in the relatively older literature. However there are also some recently described taxa, such as the hawk-owl Ninox rumseyi Rasmussen et al. 2012, and a bunting subspecies from Tristan du Cuhna (Nesospiza acunhae fraseri Ryan, 2008) that are missing from GBIF.

Sunday, November 23, 2014

Automatically extracting possible taxonomic synonyms from the literature

Quick notes on an experimental feature I've added to BioNames. It attempts to identify possible taxonomic synonyms by extracting pairs of names with the same species name that appear together on the same page of text. The text could be full text for an open access article, OCR text from BHL, or the title and abstract for an article. For example, the following paper creates a new combination, Hadwenius tursionis, for a parasite of the bottlenose dolphin. This name is a synonym of Synthesium tursionis.

Fernández, M., Balbuena, J. A., & Raga, J. A. (1994, July). Hadwenius tursionis (Marchi, 1873) n. comb. (Digenea, Campulidae) from the bottlenose dolphin Tursiops truncatus (Montagu, 1821) in the western Mediterranean. Syst Parasitol. Springer Science + Business Media. doi:10.1007/bf00009519

The taxonomic position of Synthesium tursionis (Marchi, 1873) (Digenea, Campulidae) is revised, based on material from 147 worms from four bottlenose dolphins Tursiops truncatus stranded off the Comunidad Valenciana (Spanish western Mediterranean). The species is transferred to Hadwenius, as H. tursionis n. comb., and characterised by a high length/width ratio of the body, spinose cirrus and unarmed metraterm. Synthesium, a monotypic genus, becomes a synonym of Hadwenius. The intraspecific variation of some morphological traits is briefly discussed.

If we extract taxonomic names from the title and abstract we have the pair (Synthesium tursionis, Hadwenius tursionis). If we do this across all the text currently in BioNames then we discover other pairs of names that include Synthesium tursionis, joining these together we can create a graph of co-occurrence of names that are synonyms (see Synthesium tursionis).

Synthesium tursionisHadwenius tursionisDicrocoelium tursionisDistomum tursionisOrthosplanchnus tursionisSynthesium (Orthosplanchnus) tursionis
These graphs are computed automatically, and there is inevitably scope for error. Taxa that are not synonyms may have the same specific name (e.g., parasites and hosts may have the same specific name), and some of the names extracted from the text may be erroneous. At the same time, anecdotally it is a useful way to discover links between names. Even better, this approach means that we have the associated evidence for each pair of names. The interface in BioNames lists the references that contain the pairs of names, so you can evaluate the evidence for synonymy. It would be useful to try and evaluate the automatically detected synonyms by comparisons with existing lists of synonyms (e.g., from GBIF).

Thursday, August 28, 2014

BioNames database can be downloaded

B8e253dc3be3d84f2c69c51b0af86c03 400x400My BioNames project has been going for over a year now, but I hadn't gotten around to providing bulk access to the data I've been collecting and cleaning. I've gone some way towards fixing this. You can now grab a snapshot of the BioNames database as a Darwin Core Archive here. This snapshot was generated on the 22nd August, so it is already a little out of date (BioNames is edited almost daily as I clean and annotate it when I should be doing other things).

The data dump doesn't capture all the information in the BioNames as I've tried to keep it simple, and Darwin Core is a bit of a pain to deal with. The actual database is in CouchDB which is (mostly) an absolute joy to work with. I replicate the database to Cloudant, which means there's a copy "in the cloud". A number of my other CouchDB projects are also in Cloudant, in the case of Australian Faunal Directory and BOL DNA Barcode Map the data is also served directly from Cloudant.

Monday, June 02, 2014

BioNames one year on

B8e253dc3be3d84f2c69c51b0af86c03 400x400It is almost a year to the day that I released BioNames, a database of "taxa, texts, and trees". This project was my entry in EOL's Computable Data Challenge. Since it went live (after much late night programming by myself and Ryan Schenk) I've been tweaking the interface, cleaning (so much cleaning), and adding data (mostly DOIs, links to BioStor, and PDFs). I also wrote a paper describing the project, published in PeerJ (http://dx.doi.org/10.7717/peerj.190).

Why BioNames?


I'm building BioNames to scratch a very specific itch. To me it is a source of enormous frustration that one of the most basic questions we can ask about a name (where was it first published?) is difficult to answer using current taxonomic databases. And if there is an answer, it is usually given as a text string describing the publication (i.e., a literature citation) rather than an identifier such as a DOI that enables me to (a) go to the publication, (b) refer to the publication in a database in an unambiguous way, and (c) discover further information about that publication by querying services that recognise that identifier.

There are enormous digitisation efforts underway by commercial publishers, digital archives, and libraries, and all of this is putting more and more literature online. This is the primary evidence base for taxonomy, it is where new names are published, taxa are described, and hypotheses of synonym and relationship are proposed, and we should be actively linking to it. Of course, there are some projects that do this, but these are typically restricted in taxonomic or geographic scope. I want all this information together in one place. Hence, BioNames.

Of course, I could wait until projects like ZooBank have all the animal names, but as I pointed out in Why the ICZN is in trouble, the ICZN and ZooBank have only a tiny fraction of the published names:
ICZN
This renders ZooBank barely usable for my purposes. There are millions of animal names in circulation, and our inability to discover much about them leads to all sorts of headaches, such as the errors in GBIF that I've mentioned earlier on this blog. I want a tool that can help me interpret those errors, and I want it now, hence BioNames.

What is in BioNames?


The original data comes from the LSID metadata served by ION. At the moment BioNames has 4,880,925 names, 1,549,152 of which are linked to a bibliographic citation. The bulk of the time I spend on BioNames consists of cleaning and clustering these citations, and linking them to digital identifiers.

To get some insight into what is left to be done I created a CSV dump of the publication data underlying BioNames, and loaded it into Google's Cloud Storage (http://storage.googleapis.com/ion-names/names3.csv). I then used Google's BigQuery to write some simple SQL queries. You can find more details here: https://github.com/rdmpage/bionames-bigquery.

Here is a summary table of the number of names that are published in an article with one of the identifiers that I track. These include DOIs, PMIDs, as well as whether the article is in BioStor, has a URL (typically to a publisher's web site), or a PDF.
IdentifierNumber of names
DOI196,915
BioStor130,792
JSTOR23,483
CiNii11,296
PMID8,886
URL72,754
PDF161,474
(any)489,029


The final row is the number of articles that have at least one identifier (some articles have multiple identifiers, such as a DOI and a link to BioStor). Given that there are approximate 1.5 million names with bibliographic citations, and around 490,000 have an identifier, the user as a 30% chance of finding the original description for an animal name picked at random. Obviously, BioNames has gaps (ION has missed a number of names, and/or publications), the taxonomic coverage of bibliographic identifiers is uneven (depending on the publications chosen by taxonomists to publish in, and the level of digitisation of those publications), and there is still a lot of data cleaning to do. But an almost 1 in 3 chance of finding something useful for a name seems a reasonable level of progress.

Out of interest I created some quick and dirty charts in Excel for different categories of identifier. Here, for example, is the percentage of names published each year that are linked to a publication with a DOI:
Doi
Over 80% of names published in 2013 were in an article with a DOI, so we are fast heading to a situation where modern zoological taxonomy is fully part of the citation graph of science. Much of this spike in 2013 is due to the adoption of DOIs by Zootaxa, which is far and away the dominant journal in animal taxonomy.

Here is the same chart for publications in BioStor.
Biostor graph
The big spike at the start is for names where the year of publication is missing. Leaving that aside, we can see the impact of the 1923 copyright cut-off in the US, which puts a big dent in the Biodiversity Heritage Library's digitisation efforts. Note, however, that BHL has a lot of post-1923 content.


Does anyone use BioNames?



I use BioNames almost every day, and have devoted way more time than is healthy to populating it. As I explore issues like the quality of the taxonomy in GBIF, I find it useful to see the original descriptions of a taxa, and its fate in subsequent revisions. In the early days I'd spend more time adding missing papers to help answer a question, but increasingly I'm finding that the content is already there. So, I find it useful, but what (gulp) if I'm the only one?

Below is the number of "sessions" per day since BioNames was launched (data from Google Analaytics for May 1st, 2013 to May 31st, 2014). After an initial flurry of interest, web traffic pretty quickly died off. Since then it's been slowly gaining more visitors, then (for reasons which escape me), it started getting a lot more traffic in April onwards:
Bionames
To give these numbers some context, for the same period BioStor (my archive of articles from BHL) had the following traffic:
Biostor
Note the different scales, BioStor is getting around 500 sessions a day during week days, BioNames gets around 200. By way of comparison, GBIF gets up to 4000 sessions a day, and this blog typically has 50-100 sessions per day.

Where next?


There are a couple of directions for the future. There is still a lot of data cleaning and linking to do. Last year I did a quick analysis of which taxonomic journals should be digitised next. I've updated this by creating a a spreadsheet that ranks the journals in BioNames by the number of names each has published, and each is coloured by the fraction of those names for which I've found a digital identifier for the paper in which they are published. This table is incomplete, and reflects not only the extent of digitisation, but also the extent to which I've managed to locate the journals online. But it is a starting point for thinking about what journals to prioritise for digitisation, or if they are already divitised, journals that I need to target for addition to BioNames. The spreadsheet is available as a Google sheet.

Another direction is data mining. In addition to the obvious task, naming locating and indexing taxonomic names, there are other things to be done. In BioStor I extract geographic point localities and specimen codes from the OCR text. These could be indexed to enable geographic or specimen-based searching. The same approach could be generalised to the literature in BioNames, so that we could track the mentions of a particular specimen, or retrieve lists of publications about a specific locality (e.g., all taxonomic papers that refer to a particular mountain range, deep sea vent, or island).

BioNames also does some limited analysis of taxonomic name co-ocurrence, for example suggesting that species names with the same specific epithet but different generic names are possible synonyms if they occur on the same page. There is a lot of scope for expanding this. I'm also keen to explore citation indexing, that is, extracting lists of literature cited from articles in BioNames, and linking those to the corresponding record in BioNames. Ultimately I want to be able to navigate through the taxonomic literature along these citation links, so that we can trace the fate of names through time.

But this is still only a start, papers such as Seltmann et al. illustrate other things that are possible once we have a large corpus of taxonomic literature available:

Seltmann, K. C., Pénzes, Z., Yoder, M. J., Bertone, M. A., & Deans, A. R. (2013, February 18). Utilizing Descriptive Statements from the Biodiversity Heritage Library to Expand the Hymenoptera Anatomy Ontology. (C. S. Moreau, Ed.)PLoS ONE. Public Library of Science (PLoS). doi:10.1371/journal.pone.0055674


So, a lot still to be done. I hope to have achieved some of this if and when I write a follow up post on the status of BioNames in a year's time.


Wednesday, January 15, 2014

What I'll be working on in 2014: knowledge graphs and Google forests

More for my own benefit than anything else I've decided to list some of the things I plan to work on this year. If nothing else, it may make sobering reading this time next year.

A knowledge graph for biodiversity


Google's introduction of the "knowledge graph" gives us a happy phrase to use when talking about linking stuff together. It doesn't come with all the baggage of the "semantic web", or the ambiguity of "knowledge base". The diagram below is my mental model of the biodiversity knowledge graph (this comes from http://dx.doi.org/10.7717/peerj.190, but I sketched most of this for my Elsevier Challenge entry in 2008, see http://dx.doi.org/10.1038/npre.2008.2579.1).

Fig 1 1x

Parts of this knowledge graph are familiar: articles are published in journals, and have authors. Articles cite other articles (represented by a loop in the diagram below). The topology of this graph gives us citation counts (number of times an article has been cited), impact factor (citations for articles in a given journal), and author-based measures such as the H-index (a function of the distribution of citations for each article you have authored). Beyond simple metrics this graph also gives us the means to track the provenance of an idea (by following the citation trail).

Publication

The next step is to grow this graph to include the other things we care about (e.g., taxa, taxon names, specimens, sequences, phylogenies, localities, etc.).

BioNames


I spent a good deal of last year building BioNames (for background see my blog posts or read the paper in PeerJ http://dx.doi.org/10.7717/peerj.190). BioNames represents a small corner of the biodiversity knowledge graph, namely taxonomic names and their associated publications (with added chocolately goodness of links to taxon concepts and phylogenies). In 2014 I'll continue to clean this data (I seem to be forever cleaning data). So far BioNames is restricted to animal names, but now that the plant folks have relaxed their previously restrictive licensing of plant data (see post on TAXACOM) I'm looking at adding the million or so plant names (once I've linked as many as possible to digital identifiers for the corresponding publications).

Spatial indexing


Now that I've become more involved in GBIF I'm spending more time thinking about spatial indexing, and our ability to find biodiversity data on a map. There's a great Google ad that appeared on UK TV late last year. In it, Julian Bayliss recounts the use of Google Earth to discover of virgin rainforest (the "Google forest") on Mount Mabu in Mozambique.



It's a great story, but I keep looking at this and wondering "how did we know that we didn't know anything about Mount Mabu?" In other words, can we go to any part of the world and see what we know about that area? GBIF goes a little way there with its specimen distribution maps, which gives some idea of what is now known from Mount Mabu (although the map layers used by GBIF are terrible compared to what Google offers).

Mabu

But I want to be able to see all the specimens now known from this region (including the new species that have been discovered, e.g. see http://dx.doi.org/10.1007/s12225-011-9277-9 and http://dx.doi.org/10.1080/21564574.2010.516275). Why can't I have a list of publications relevant to this area (e.g., species descriptions, range extensions, ecological studies, conservation reports)? What about DNA sequences from material in this region (e.g., from organismal samples, DNA barcodes, metagenomics, etc.)? If GBIF is to truly be a "Global Biodiversity Information Facility" then I want it to be able to provide me with a lot more information than it currently does. The challenge is how to enable that to happen.

Tuesday, October 08, 2013

Which taxonomic journals should be digitised next?

One reason I was able to build BioNames is because a significant fraction of the taxonomic literature for animals is now online, either due to the efforts of the Biodiversity Heritage Library, digital archives, commercial publishers, or individual institutions and scientific societies. However there are still big gaps in literature availability. To get a sense of these gaps I've constructed a table listing all the journals in BioNames that have an ISSN, ordered by the number of articles in BioNames (i.e., mostly articles that publish new names). The full table is here, I've reproduced part of it below (limited to those journals with at least 500 articles in BioNames). If you click on the ISSN in the table you can go to the corresponding page in BioNames to get full details of what BioNames currently knows about that journal.

The journals in red are the ones with the worst online presence (see complete key below). Note that BioNames is still a work in progress so there will be some journals that are online but I've simply not had a chance to add them to BioNames. With that in mind, there are some striking gaps in the digital availability of taxonomic publications. Several Russian journals (collectively publishing thousands of articles) are not online (the story here is somewhat complicated because some Russian journals also have English-language translations available but these are mostly recent articles). A number of large entomological journals are not available (perhaps not surprising given that most described animal taxa are insects).

We can think of this as a "league table" of literature availability. My hope is that digitising projects such as the Biodiversity Heritage Library will look at this and use it to help prioritise which journals to scan. In particular, if the journal is not pre-1923 (and therefore out of US copyright) I hope BHL will then contact the journal's publisher and see if they would be willing to add their journal to those (such as Proceedings of the Biological Society of Washington) that have opened up their complete back catalogue to being scanned by BHL.

I also hope that scientific societies or organisations that publish journals in the "red" or "orange" zones will consider digitising their journals and making their contents accessible to the wider community. We are reaching the point where if knowledge is not online then it effectively doesn't exist.


> 90%Almost all are available
< 90%Most are available
< 50%Limited availability
< 10%Mostly inaccessible
ISSN (click for details)JournalArticlesDigitised% digitised
1175-5326Zootaxa8581818995
0374-5481The Annals and magazine of natural history4463350278
1000-0739880-01 Dong wu fen lei xue bao. Acta zootaxonomica Sinica3403245072
0006-324XProceedings of the Biological Society of Washington3384326396
0022-3360Journal of paleontology3373312193
0037-928XBulletin de la Société entomologique de France30122448
0013-8797Proceedings of the Entomological Society of Washington2972280594
0044-5134Zoologicheskiĭ zhurnal2812161
0044-5231Zoologischer Anzeiger276159422
0022-3395The Journal of parasitology2353222294
0008-347XThe Canadian entomologist2260205991
0003-0082American Museum novitates1942181493
0035-418XRevue suisse de zoologie1851158185
0022-2933Journal of natural history1848182399
0367-1445Entomologicheskoe obozrenie180330
0096-3801Proceedings of the United States National Museum1722136579
0013-872XEntomological news1691161996
0370-2774Proceedings of the Zoological Society of London1580100864
1000-7482880-01 Kun chong fen lei xue bao = Entomotaxonomia1518112774
0037-9271Annales de la Société entomologique de France149775751
0031-031X880-01 Paleontologicheskiĭ zhurnal1472312
0013-8746Annals of the Entomological Society of America1441138396
0035-1814Revue de zoologie et de botanique africaines1400473
0031-0603The Pan-Pacific entomologist1389564
0323-6145Berliner entomologische Zeitschrift / herausgegeben von dem Entomologischen Vereine in Berlin134271053
1148-8425Bulletin du Muséum National d'Histoire Naturelle réunion mensuelle des naturalistes du Muséum130350639
0013-8908The Entomologist's monthly magazine126860
0044-586XAcarologia1226877
0045-8511Copeia1191109592
0031-0239Palaeontology1185115497
0001-6616880-03 Gu sheng wu xue bao = Acta palaeontologica Sinica112700
0165-5752Systematic parasitology1082102895
0454-6296880-01 Kun chong xue bao = Acta entomologica Sinica / Zhongguo kun chong xue hui bian ji105490286
0024-0672Zoologische mededeelingen / uitgegeven vanwege 's Rijksmuseum van Natuurlijke Historie te Leiden103999796
0370-047XProceedings of the Linnean Society of New South Wales103874271
0030-5316Oriental insects103591689
0028-7199Journal of the New York Entomological Society101386085
0521-4726Annales historico-naturales Musei Nationalis Hungarici = Természettudományi Múzeum évkönyve100788688
0070-7279Reichenbachia / Staatliches Museum für Tierkunde in Dresden95120
0022-8567Journal of the Kansas Entomological Society94590696
0373-3491Bollettino della Società entomologica italiana940141
0037-2102Senckenbergiana biologica939111
0002-8320Transactions of the American Entomological Society92379686
0374-9797Nouvelle revue d'entomologie92310
0774-2819Lambillionea91800
0034-7108Revista Brasileira de biologia91661
0007-1595Bulletin of the British Ornithologists' Club91145950
0013-8843Entomologische Zeitschrift88140
0253-116XLinzer biologische Beiträge / Oberösterreiches Landesmuseum87650357
0272-4634Journal of vertebrate paleontology86986499
1217-8837Acta zoologica Academiae Scientiarum Hungaricae86813415
0011-216XCrustaceana865865100
0085-5626Revista brasileira de entomologia86326030
0365-4389Annali del Museo civico di storia naturale "Giacomo Doria."85550359
0097-3157Proceedings of the Academy of Natural Sciences of Philadelphia84850059
0010-065XThe Coleopterists' bulletin83180497
1313-2989ZooKeys827827100
0024-4082Zoological journal of the Linnean Society823821100
0008-4301Canadian journal of zoology81780398
0028-1344The Nautilus81450162
0040-7496Tijdschrift voor entomologie80458072
0375-0434Proceedings of the Royal Entomological Society of London. Series B, Taxonomy79678398
0033-2615Psyche79670989
0164-7954International journal of acarology787786100
0003-0090Bulletin of the American Museum of Natural History77648863
0037-962XBulletin de la Société zoologique de France76522830
0181-0863Revue française d'entomologie76561
1562-0891Wiener Entomologische Zeitung75257376
1000-3118880-01 Gu ji zhui dong wu xue bao74341
0003-0023Transactions of the American Microscopical Society731728100
0075-6547Koleopterologische Rundschau / herausgegeben von der Zoologisch-Botanischen Gesellschaft gemeinsam mit der Forstlichen Bundesversuchsanstalt70633948
0286-9810880-01 The entomological review of Japan = Konchūgaku hyōron7049814
0867-1710Genus69020
0042-3580Venus : Japanese journal of malacology = Kairuigaku zasshi68753177
0067-1975Records of the Australian Museum67962993
0006-6982The Journal of the Bombay Natural History Society6778112
0320-9180Zoosystematica rossica67661
0084-5604Vestnik zoologii / Akademii︠a︡ nauk Ukrainskoĭ SSR, Institut zoologii672376
0387-5733Elytra66610816
0043-0439Journal of the Washington Academy of Sciences66460391
0003-4541Annales zoologici / Polska Akademia Nauk, Instytut Zoologiczny66133651
0016-6995Geobios65947572
0004-2110Arkiv för zoologi / utgivet af K. Svenska vetenskaps-akademien658599
0035-8894Transactions of the Royal Entomological Society of London65549576
0915-5805Japanese journal of entomology64562096
0013-8878The Entomologist645142
0031-1820Parasitology64161496
0007-4853Bulletin of entomological research63361197
0375-099XRecords of the Indian Museum a journal of Indian zoology ed. by the Director, Zoological Survey of India63021334
1326-6756Australian journal of entomology629629100
0018-8158Hydrobiologia627627100
0013-8770880-02 Konchū = Kontyū62561699
0217-2445The Raffles bulletin of zoology62257192
0372-1426Transactions of the Royal Society of South Australia, Incorporated62245072
0079-8835Memoirs of the Queensland Museum62037360
0003-4150Annales de parasitologie humaine et comparée61235558
0018-0130Proceedings of the Helminthological Society of Washington60458897
0015-4040The Florida entomologist602601100
0077-7749Neues Jahrbuch für Geologie und Paläontologie. Abhandlungen60214624
1066-5234The journal of eukaryotic microbiology60157295
0031-0220Paläontologische Zeitschrift6015810
0567-7920Acta palaeontologica Polonica59957896
0032-3780Polskie pismo entomologiczne. Bulletin entomologique de Pologne590285
0027-4100Bulletin of the Museum of Comparative Zoology at Harvard College58144476
0042-3211The Veliger57827447
0181-0626Bulletin du Muséum national d'histoire naturelle. Section A, Zoologie, biologie et écologie animales57456498
0068-547XProceedings of the California Academy of Sciences57326146
0035-6387Rivista di parassitologia56620
0003-5092Annotationes zoologicae Japonenses / auspiciis Societatis Zoologicae Tokyonensis seriatim editae = Nihon dōbutsugaku ihō56254597
0036-7575Mitteilungen der Schweizerischen entomologischen Gesellschaft = Bulletin de la Société entomologique suisse56231
0251-074XRevue de zoologie africaine560183
0373-9465Folia entomologica Hungarica = Rovartani közlemények55561
0206-0477880-01 Trudy Zoologicheskogo instituta = Travaux de l'Institut zoologique de l'Académie des sciences de l'URSS / Akademii︠a︡ nauk Soi︠u︡za Sovetskikh Sot︠s︡ialisticheskikh Respublik55420
1445-5226Invertebrate systematics550550100
0026-2803Micropaleontology54840975
0307-6970Systematic entomology53752698
0020-1804Insecta matsumurana53651496
0278-0372Journal of crustacean biology : a quarterly of the Crustacean Society for the publication of research on any aspect of the biology of crustacea531531100
0165-0424Aquatic insects525525100
1051-8932Bulletin of the Brooklyn Entomological Society52331
0013-8711Entomologica scandinavica52251398
0341-8391Spixiana51546590
0013-8789Journal of the Entomological Society of Southern Africa51539276
0018-0831Herpetologica51447292
0323-7087Zoologische Jahrbücher. Abteilung für Systematik, Geographie und Biologie der Tiere51317634
0007-4977Bulletin of marine science51039778
0250-4413Entomofauna50038777