@rdmpage but ~700,000 papers were published in 2009. Were there even 70,000 published in 1920? 2000-2012 contains *a lot*Ross was wondering whether we should invest much effort in extracting information from legacy literature, suggesting that this literature was of most interest to taxonomists, whereas other biologists will be more likely to find what they want from ever growing recent literature. I was arguing that because many taxa are poorly studied, the chances that you will find data on your organism in the recent literature is likely to be low, unless you study an economically or medically important taxon, or a model organism (many of which fit first categories). My view is based on papers such as Bob May's 1988 paper:
— Ross Mounce (@rmounce) February 13, 2013
MAY, R. M. (1988). How Many Species Are There on Earth? Science, 241(4872), 1441-1449. doi:10.1126/science.241.4872.1441In table 3 May lists the average number of papers per species in the period 1978-1987 across various taxonomic groups. Mammals averaged 1.8 papers per species, beetles averaged 0.01. This means that if you study a beetle species you have a 1/100 chance (on average) of finding a paper on your species in any given year (assuming all beetles are equal, which is clearly false). At this point perhaps we should define "legacy literature". In many ways the issue is not so much the age of the literature, but whether the literature was "born digital", that is, whether from it's authoring to publication the document has been in digital form, so the output is in a format (e.g., HTML, XML, or PDF that contains the document text) from which we can readily extract and mine the text. In contrast, documents that have been digitised from a physical medium (e.g., scans of pages) are less tractable because the text has to be extracted by OCR, and error-prone process. Given these errors is the effort worth it. At this point I should say that BHL is not using the best OCR technology available (my own experience suggests that ABBYY Online is much better), and our community is not making use of research on automating OCR correction). But the question is worth asking. In an effort to answer it, I've done a quick analysis of the PanTHERIA database:
Jones, K. E., Bielby, J., Cardillo, M., Fritz, S. A., O Dell, J., Orme, C. D. L., & Purvis, A. (2009). PanTHERIA: a species-level database of life history, ecology, and geography of extant and recently extinct mammals. (W. K. Michener, Ed.)Ecology, 90(9), 2648-2648. doi:10.1890/08-1494.1PanTHERIA is a database assembled by Kate Jones (@ProfKateJones) and colleagues for comparative biologists (not taxonomists), and collects fundamental biological data about the best studied animal group on the planet (see May's paper above). In the metadata for the database there is a list of the 3143 publications they consulted to populate the database. Below is a table showing the distribution of the year in which these publications appeared:
I've put the articles cited as data sources by the PanTHERIA database in a Mendeley group.