tag:blogger.com,1999:blog-160817792024-03-16T01:11:53.708+00:00iPhyloRants, raves (and occasionally considered opinions) on phyloinformatics, taxonomy, and biodiversity informatics. For more ranty and less considered opinions, see my <a href="https://twitter.com/rdmpage">Twitter feed</a>.<br>ISSN 2051-8188. Written content on this site is licensed under a <a href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International license</a>.Roderic Pagehttp://www.blogger.com/profile/00269598293846172649noreply@blogger.comBlogger847125tag:blogger.com,1999:blog-16081779.post-13329025004710930172024-02-20T15:32:00.007+00:002024-02-22T12:14:08.665+00:00Problems with the DataCite Data Citation Corpus<p>DataCite have released the <a href="https://makedatacount.org/data-citation">Data Citation Corpus</a>, together with a <a href="http://corpus.datacite.org/dashboard">dashboard</a> that summarises the corpus. This is billed as:</p>
<blockquote>
<p>A trusted central aggregate of all data citations to further our <strong>understanding of data usage</strong> and advance meaningful data metrics</p>
</blockquote>
<p>The goal is to build a citation database between scholarly articles and data, such as datasets in repositories, sequences in GenBank, protein structures in PDB, etc. Access to the corpus can be obtained by submitting a form, then having a (very pleasant) conversation with DataCite about the nature of the corpus. This process feels clunky because it introduces friction. If you want people to explore this, why not make it a simple download?</p>
<p>I downloaded the corpus, which is nearly 7 Gb of JSON, formatted as an array(!), thankfully with one citation per line so it is reasonably easy to parse. (<a href="https://jsonlines.org">JSON Lines</a> would be more convenient).</p>
<p>I loaded this into a SQLite database to make it easier to query, and I have some thoughts. Before outling why I think the corpus has serious problems, I should emphasise that I’m a big fan of what DataCite are trying to do. Being able to track data usage to give credit to researchers and repositories (citations to data as well as papers), to track provenance of data (e.g., when a GenBank sequence turns out to be wrong being able to find all the studies that used it), and to find addition links between papers beyond bibliographic links (e.g., when data is cited but not the original publication) are all good things. Obviously, lots of people have talked about this, but this is my blog so I’ll cite myself as an example 😉.</p>
<blockquote>
<p>Page, R. Visualising a scientific article. <em>Nat Prec</em> (2008). <a href="https://doi.org/10.1038/npre.2008.2579.1">https://doi.org/10.1038/npre.2008.2579.1</a></p>
</blockquote>
<p>My main interest in the corpus is tracking citations of DNA sequences, which are often not linked to even the original publication in GenBank. I was hopeful the corpus could help in this work.</p>
<p>Ok, let’s now look at the actual corpus.</p>
<h2 id="data-structure">Data structure</h2>
<p>Each citation comprises a JSON object, with a mix of external identifiers such as DOIs, and internal identifiers as UUIDs. The later are numerous, and make the data file much bigger than it needs to be. For example, there are two sources of citation data, DataCite, and the Chan Zuckerberg Initiative. These have <code>sourceId</code> values of <code>3644e65a-1696-4cdf-9868-64e7539598d2</code> and <code>c66aafc0-cfd6-4bce-9235-661a4a7c6126</code>, respectively. There are a little over 10 million citations in the corpus, so that’s a lot of bytes that could simply have been <code>1</code> or <code>2</code>.</p>
<p>More frustrating than the wasted space is the lack of any list of what each UUID means. I figured out that <code>3644e65a-1696-4cdf-9868-64e7539598d2</code> is DataCite only by looking at the data, knowing that CZI had contributed more ecords than DataCite. For other entities such as repositories and publishers, one has to go spelunking in the data to make reasonable guesses as to what the repositories are. Given that most citations seem to be to biomedical entities, why not use something such as the compact identifiers from <a href="https://identifiers.org">Identifiers.org</a> for each reppository?</p>
<h2 id="dashboard">Dashboard</h2>
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhWpilOoLPSxgTnxeWyCXx39ef-LwSoQ-rkl7dowe8SLIGcAVD0gcNi8Z0DxSg6hU7qDJJmI8EaLK0atYZF3UTRQsTQIGh0G9MULOC48OTe3g2HOqzh9RvouDTSlqvUiEiMAhE_4K4KPgEnchi6DvdwobW8tJKIMOUnwlxv5Dk4GcJCIsqkGMyM/s1788/Screenshot%202024-02-20%20at%2011.58.57.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" height="400" data-original-height="1788" data-original-width="1534" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhWpilOoLPSxgTnxeWyCXx39ef-LwSoQ-rkl7dowe8SLIGcAVD0gcNi8Z0DxSg6hU7qDJJmI8EaLK0atYZF3UTRQsTQIGh0G9MULOC48OTe3g2HOqzh9RvouDTSlqvUiEiMAhE_4K4KPgEnchi6DvdwobW8tJKIMOUnwlxv5Dk4GcJCIsqkGMyM/s400/Screenshot%202024-02-20%20at%2011.58.57.png"/></a></div>
<p>DataCite provides a <a href="http://corpus.datacite.org/dashboard">dashboard</a> to summarise key features of the corpus. There are a couple of aspects of the dashboard that I find frustrating.</p>
<p>Firstly, the “citation counts by subject” is misleading. A quick glance suggests that law and sociology are the subjects that most actively cite data. This would be surprising, especially given that much of the data generated by CZI comes from PubMed Central. Only 50,000 citations out of 10 million comprise articles with subject tags, so this chart is showing results for approximately 0.5% of the corpus. The chart includes the caveat “The visualization includes the top 20 subiects where metadata is available.” but omits to tell us that as a result the chart is irrelevant for >99% of the data.</p>
<p>The dashboard is interesting in what it says about the stakeholders of this project. We see counts of citations broken down by source (CZI or DataCite), and publisher, but not by repository. This suggests that repositories are second class citizens. Surely they deserve a panel on the dashboard? I suspect researchers are going to be more interested in what kinds of data are being cited than what academic publishers are in the corpus. For instance, 3.75 million (37.5%) citations are to sequences in GenBank, 1.7 million (17.5%) are to the <a href="https://www.rcsb.org">Protein Data Bank (PDB)</a>, and 0.89 million (8.9%) are to SNPs.</p>
<h2 id="chan-zuckerberg-initiative-and-ai">Chan Zuckerberg Initiative and AI</h2>
<p>The corpus is a collaboration between DataCite and the Chan Zuckerberg Initiative (CZI) and CZI are responsible for the bulk of the data. Unfortunately there is no description of how those citations were extracted from the source papers. Perhaps CZI used something like SciBERT which they employed in earlier work to extract citations to scientific software <a href="https://arxiv.org/abs/2209.00693">https://arxiv.org/abs/2209.00693</a>? We don’t know. One reason this matters is that there are lots of cases where the citations are incorrect, and if we are going to figure out why, we need to know how they were obtained. At present it is simply a black box.</p>
<p>These are just a few examples of incorrect citations:</p>
<ul>
<li>The mouse line <em>Prdm</em><sup><em>11tm1.1ahl</em></sup> is conflated with the PDB identifier <a href="https://www.rcsb.org/structure/1AHL">1ahl</a>, see <a href="https://hyp.is/c2Xras_KEe6zEGcm97yBRw/journals.plos.org/plosone/article?id=10.1371/journal.pone.0134503">https://hyp.is/c2Xras_KEe6zEGcm97yBRw/journals.plos.org/plosone/article?id=10.1371/journal.pone.0134503</a></li>
<li>A museum specimen CR00240699 is mistakenly interpreted as a GenBank accession number, see <a href="https://hyp.is/CGTJcM_kEe674TfyvGLC0A/zookeys.pensoft.net/article/21580/download/pdf/287887">https://hyp.is/CGTJcM_kEe674TfyvGLC0A/zookeys.pensoft.net/article/21580/download/pdf/287887</a></li>
<li>A grant number Y21026 is is mistakenly interpreted as a GenBank accession number, see <a href="https://hyp.is/HpVXhs9PEe6D2UMxrIdqJw/bmjopen.bmj.com/content/12/9/e054887">https://hyp.is/HpVXhs9PEe6D2UMxrIdqJw/bmjopen.bmj.com/content/12/9/e054887</a></li>
<li>The time period 24 hours (24hr) is conflated with a PDB record 24hr that doesn’t exist <a href="https://hyp.is/dNfqZs9SEe6U2nMOKHb-Pw/journal.waocp.org/article_89819_8835738205ecaaad36eebfa826a17779.pdf">https://hyp.is/dNfqZs9SEe6U2nMOKHb-Pw/journal.waocp.org/article_89819_8835738205ecaaad36eebfa826a17779.pdf</a>. There are a lot of these, such as 17<sup>th</sup>, 2016, etc.</li>
</ul>
<p>These are just a few examples I came across while pottering around with the corpus. I’ve not done any large-scale analysis, but one ZooKeys article I came across <a href="https://doi.org/10.3897/zookeys.739.21580">https://doi.org/10.3897/zookeys.739.21580</a> cites 32 entities, only four of which are correct.</p>
<p>I get that text mining is hard, but I would expect AI would do better than what we could achieve by simply matching dumb regular expressions. For example, surely a tool that claims any measure of intelligence would be able to recognised that this sentence lists grant numbers, not a GenBank accession number?</p>
<blockquote>
<p>Funding This study was supported by Longhua Hospital Shanghai University of Traditional Chinese Medicine (grant number: Y21026), and Longhua Hospital Shanghai University of Traditional Chinese Medicine (YW.006.035)</p>
</blockquote>
<p>As a fallback, we could also check that a given identifier is valid. For example, there is no sequence with the accession number <a href="https://www.ncbi.nlm.nih.gov/nuccore?term=Y21026">Y21026</a>. The set of possible identifiers is finite (if large), why didn’t the corpus check whether each identifier extracted actually existed?</p>
<h2>Update: major errors found</h2>
<p>I've created a <a href="https://github.com/rdmpage/data-citation-corpus">GitHub repo</a> to keep track of the errors I'm finding.</p>
<h3>Protein Data Bank</h3>
<p>The <a href="https://www.wwpdb.org">Protein Data Bank</a> (PDB) is the second largest repository in the corpus with 1,729,783 citations. There are 177,220 distinct PDB identifiers cited. These identifiers should match the pattern <code>/^[0-9][A-Za-z0-9]{3}$/</code>, that is, a number 0-9 followed by three alphanumeric characters. However 31,612 (18%) do not. Examples include "//osf.io/6bvcq" and "//evs.nci.nih.gov/ftp1/CTCAE/CTCAE_4.03/Archive/CTCAE_4.0_2009-05-29_QuickReference_8.5x11.pdf". So the tools for finding PDB citations do not understand what a PDB identifier should look like.</p>
<p>Out of curiousity I downloaded all the exiting PDB identifiers from <a href="https://files.wwpdb.org/pub/pdb/holdings/current_file_holdings.json.gz">https://files.wwpdb.org/pub/pdb/holdings/current_file_holdings.json.gz</a>, which gave me 216,225 distinct PDB identifiers. Comparing actual PDB identifiers with ones included in the corpus I got 1,233,993 hits, which is 71% of the total in the corpus. Hence over half a million (a little under a third of the PDB citations) appear to be made up.</p>
<h3>Individual articles</h3>
<h4>Taxonomic revision of Stigmatomma Roger (Hymenoptera: Formicidae) in the Malagasy region</h4>
<p>The paper <a href="https://doi.org/10.3897/BDJ.4.e8032">https://doi.org/10.3897/BDJ.4.e8032</a> is credited with citing 126 entities, including 108 sequences and 14 PDB records. None of this is true. The supposed PDB records are figure numbers, e.g. “Fig. 116d” becomes PDB 116d, and the sequence accession numbers are specimen codes or field numbers.</p>
<h3>Nucleotide sequences</h3>
<p>Sequence data is the single largest data type cited in the corpus, with 3.8 million citations. I ran a sample of the first 1000 sequences accession numbers in the corpus against GenBank and in 486 cases GenBank didn't recognise the accession number as valid. So potentially half the sequence citations are wrong.</p>
<h2 id="summary">Summary</h2>
<p>I think the Data Citation Corpus is potentially a great resource, but if it is going to be “[a] trusted central aggregate of all data citations” then I think there are a few things it needs to do:</p>
<ul>
<li>Make the data more easily accessible so that people can scrutinise it without having to jump through hoops</li>
<li>Tell us how the Chan Zuckerberg Initiative did the entity matching</li>
<li>Improve the entity matching</li>
<li>Add a quality control step that validates extracted identifiers</li>
<li>Expand the dashboard to give users a better sense of what data is being cited</li>
</ul>
<blockquote>
<p>Written with <a href="https://stackedit.io/">StackEdit</a>.</p>
</blockquote>
Roderic Pagehttp://www.blogger.com/profile/00269598293846172649noreply@blogger.comtag:blogger.com,1999:blog-16081779.post-25558479689748252352023-11-29T11:30:00.005+00:002024-02-07T10:27:07.133+00:00It's 2023 - why are we still not sharing phylogenies?<div style="padding:1em;border:1px solid #DDD;border-radius:0.2em;">How to cite: <strong>Page, R. (2023). It’s 2023 - why are we still not sharing phylogenies? <a href="https://doi.org/10.59350/n681n-syx67">https://doi.org/10.59350/n681n-syx67</a></strong></div>
<p>A quick note to support a recent Twitter thread <a href="https://twitter.com/rdmpage/status/1729816558866718796?s=61&t=nM4XCRsGtE7RLYW3MyIpMA">https://twitter.com/rdmpage/status/1729816558866718796?s=61&t=nM4XCRsGtE7RLYW3MyIpMA</a></p>
<p>The article “Diversification of flowering plants in space and time” by Dimitrov et al. describes a genus-level phylogeny for 14,244 flowering plant genera. This is a major achievement, and yet neither the tree nor the data supporting that tree are readily available. There is lots of supplementary information (as PDF files), but no machine readable tree or alignment data.</p>
<blockquote>
<p>Dimitrov, D., Xu, X., Su, X. <em>et al.</em> Diversification of flowering plants in space and time. <em>Nat Commun</em> <strong>14</strong>, 7609 (2023). <a href="https://doi.org/10.1038/s41467-023-43396-8">https://doi.org/10.1038/s41467-023-43396-8</a></p>
</blockquote>
<p>What we have is a link to a <a href="https://en.geodata.pku.edu.cn/index.php?c=content&a=list&catid=200">web site</a> which in turn has a link to a <a href="https://geodata.pku.edu.cn/OneZoomSeedPlants/seed_plants.htm">OneZoom visualisation</a>. If you look at the source code for the web site you can see the phylogeny in Newick format as a Javascript file.</p>
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhwn95hB-WqA10yK-CFU120lsc_SiDYHyd1PyqXzUxmWGC7-PdUykllNS1Y3bhEAJzIzXW73nEHhyhDl_wcHekAeUEN9rZIRQl9eoNX91gDzi1aKnORzvIS9pf4m9aElI4lxZ8gNtpbxGyzqm1fYVj7Gh4VrxsoIo9Z1Ii2WarmHgUgrEzSQPzb/s1072/Screenshot%202023-11-29%20at%2011.00.47.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" width="400" data-original-height="706" data-original-width="1072" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhwn95hB-WqA10yK-CFU120lsc_SiDYHyd1PyqXzUxmWGC7-PdUykllNS1Y3bhEAJzIzXW73nEHhyhDl_wcHekAeUEN9rZIRQl9eoNX91gDzi1aKnORzvIS9pf4m9aElI4lxZ8gNtpbxGyzqm1fYVj7Gh4VrxsoIo9Z1Ii2WarmHgUgrEzSQPzb/s400/Screenshot%202023-11-29%20at%2011.00.47.png"/></a></div>
<p>This is a far from ideal way to share data. Readers can’t easily get the tree, explore it, evaluate it, or use it in their own analyses. I grabbed the tree and put it online as a GitHub <a href="https://gist.github.com/rdmpage/dd9ffd3f67b80b1d560cc9448867d92a">GIST</a>. Once you have the tree you can do things such as try a different tree viewer, such as <a href="https://phylocloud.cgmlab.org/tree_page/65670f1aac7dc72a041052c9/">PhyloCloud</a></p>
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiLHyspB-1kD3CoCkNUAM470pEF-W3g7XEi_8yrVe_rHnqbLrxk8RcymK6QSbjzN4PkFGzyNv4AxQYY6nHlWRH1cfu4ztgsICTrsUsfk6ZPp1Bssxla_JxkT9oKokScdARUX42kyCsIqRP5K1k1rQGygi9vbvn5BJ-4ab_XR_Cvl9Qi2qGLqAYi/s1341/Screenshot%202023-11-29%20at%2011.03.37.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" width="400" data-original-height="1180" data-original-width="1341" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiLHyspB-1kD3CoCkNUAM470pEF-W3g7XEi_8yrVe_rHnqbLrxk8RcymK6QSbjzN4PkFGzyNv4AxQYY6nHlWRH1cfu4ztgsICTrsUsfk6ZPp1Bssxla_JxkT9oKokScdARUX42kyCsIqRP5K1k1rQGygi9vbvn5BJ-4ab_XR_Cvl9Qi2qGLqAYi/s400/Screenshot%202023-11-29%20at%2011.03.37.png"/></a></div>
<p>That is a start, but it’s clearly not ideal. Why didn’t the authors put the tree (and the data) into a proper repository, such as <a href="https://zenodo.org">Zenodo</a> where it would be persistent and citable, and also linked to the authors’ ORCID profile? That way everybody wins, readers get a tree to explore, the authors have an additional citable output.</p>
<p>The state of sharing of phylogenetic data is dire, not helped by the slow and painful demise of <a href="https://treebase.org/">TreeBASE</a>. Sharing machine readable trees and datasets still does not seem to be the norm in phylogenetics.</p>
<blockquote>
<p>Written with <a href="https://stackedit.io/">StackEdit</a>.</p>
</blockquote>
Roderic Pagehttp://www.blogger.com/profile/00269598293846172649noreply@blogger.comtag:blogger.com,1999:blog-16081779.post-9036685394910515832023-10-26T16:11:00.006+01:002024-02-07T10:30:02.113+00:00Where are the plant type specimens? Mapping JSTOR Global Plants to GBIF<div style="padding:1em;border:1px solid #DDD;border-radius:0.2em;">How to cite: <strong>Page, R. (2023). Where are the plant type specimens? Mapping JSTOR Global Plants to GBIF. <a href="https://doi.org/10.59350/m59qn-22v52">https://doi.org/10.59350/m59qn-22v52</a></strong></div>
<p>This blog post documents my attempts to create links between two major resources for plant taxonomy: JSTOR’s <a href="https://plants.jstor.org">Global Plants</a> and <a href="https://www.gbif.org">GBIF</a>, specifically between type specimens in JSTOR and the corresponding occurrence in GBIF. The TL;DR is that I have tried to map 1,354,861 records for type specimens from JSTOR to the equivalent record in GBIF, and managed to find 903,945 (67%) matches.</p>
<h2 id="why-do-this">Why do this?</h2>
<p>Why do this? Partly because a collaborator asked me, but I’ve long been interested in JSTOR’s Global Plants. This was a massive project to digitise plant type specimens all around the world, generating millions of images of herbarium sheets. It also resulted in a standardised way to refer to a specimen, namely its barcode, which comprises the herbarium code and a number (typically padded to eight digits). These barcodes are converted into JSTOR URLs, so that E00279162 becomes <a href="https://plants.jstor.org/stable/10.5555/al.ap.specimen.e00279162">https://plants.jstor.org/stable/10.5555/al.ap.specimen.e00279162</a>. These same barcodes have become the basis of efforts to create stable identifiers for plant specimens, for example <a href="https://data.rbge.org.uk/herb/E00279162">https://data.rbge.org.uk/herb/E00279162</a>.</p>
<p>JSTOR created an elegant interface to these specimens, complete with links to literature on JSTOR, BHL, and links to taxon pages on GBIF and elsewhere. It also added the ability to comment on individual specimens using <a href="https://disqus.com">Disqus</a>.</p>
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjaJx_5psYTAPfRLaomxACMxZ5Ant1peCHJixf0PeYP82k2uw9vOq4P2Q-HzSakV8q5aSmfqs0X_J_OJmnnhttknlwVYeEguQaxhk4fWi0lKvh_KzugoZWd_MjHmXcImkuMVbJPxXmwln6-Ah7iy4elXNGhfhmofrZbHrgcc3VC7E7lrVd05Utw/s1032/Screenshot%202023-10-26%20at%2013.30.11.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" width="400" data-original-height="666" data-original-width="1032" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjaJx_5psYTAPfRLaomxACMxZ5Ant1peCHJixf0PeYP82k2uw9vOq4P2Q-HzSakV8q5aSmfqs0X_J_OJmnnhttknlwVYeEguQaxhk4fWi0lKvh_KzugoZWd_MjHmXcImkuMVbJPxXmwln6-Ah7iy4elXNGhfhmofrZbHrgcc3VC7E7lrVd05Utw/s400/Screenshot%202023-10-26%20at%2013.30.11.png"/></a></div>
<p>However, JSTOR Global Plants is not open. If you click on a thumbnail image of a herbarium sheet you hit a paywall.</p>
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgJg1Ek6wzEKStEwOAgm3wgI8sC49qq__akV0290AXG5yo1051G1SyR0wX7Psfba2E8ulEK4qixgvefRfXCxqZt5xilCbqgemLQJqwYsBPUQRhUxqWHaL2dtSO2FIplcujmGFXJMg0GwCEJ4qn5HqUHq93rOBvlVnLX6tJ9iOlwrPf8aeQ64bko/s1041/Screenshot%202023-10-26%20at%2013.30.29.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" width="400" data-original-height="701" data-original-width="1041" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgJg1Ek6wzEKStEwOAgm3wgI8sC49qq__akV0290AXG5yo1051G1SyR0wX7Psfba2E8ulEK4qixgvefRfXCxqZt5xilCbqgemLQJqwYsBPUQRhUxqWHaL2dtSO2FIplcujmGFXJMg0GwCEJ4qn5HqUHq93rOBvlVnLX6tJ9iOlwrPf8aeQ64bko/s400/Screenshot%202023-10-26%20at%2013.30.29.png"/></a></div>
<p>In contrast data in GBIF is open. The table below is a simplified comparison of JSTOR and GBIF.</p>
<table>
<thead>
<tr>
<th>Feature</th>
<th>JSTOR</th>
<th>GBIF</th>
</tr>
</thead>
<tbody>
<tr>
<td>Open or paywall</td>
<td>Paywall</td>
<td>Open</td>
</tr>
<tr>
<td>Consistent identifier</td>
<td>Yes</td>
<td>No</td>
</tr>
<tr>
<td>Images</td>
<td>All specimens</td>
<td>Some specimens</td>
</tr>
<tr>
<td>Types linked to original name</td>
<td>Yes</td>
<td>Sometimes</td>
</tr>
<tr>
<td>Community annotation</td>
<td>Yes</td>
<td>No</td>
</tr>
<tr>
<td>Can download the data</td>
<td>No</td>
<td>Yes</td>
</tr>
<tr>
<td>API</td>
<td>No</td>
<td>Yes</td>
</tr>
</tbody>
</table><p>JSTOR offers a consistent identifier (the barcode), an image, has the type linked to the original name, and community annotation. But there is a paywall, and no way to download data. GBIF is open, enables both bulk download and API access, but often lacks images, and as we shall see below, the identifiers for specimens are a hot mess.</p>
<p>The “Types linked to original name” feature concerns whether the type specimen is connected to the appropriate name. A type is (usually) the type specimen for a single taxonomic name. For example, E00279162 is the type for <em>Achasma subterraneum</em> Holttum. This name is now regarded as a synonym of <em>Etlingera subterranea</em> (Holttum) R. M. Sm. following the transfer to the genus <em>Etlingera</em>. But E00279162 is not a type for the name <em>Etlingera subterranea</em>. JSTOR makes this clear by stating that the type is stored under <em>Etlingera subterranea</em> but is the type for <em>Achasma subterraneum</em>. However, this information does not make it to GBIF, which tells us that E00279162 is a type for <a href="https://www.gbif.org/species/2760154"><em>Etlingera subterranea</em></a> and that it knows of no type specimens for <a href="https://www.gbif.org/species/2760155"><em>Achasma subterraneum</em></a>. Hence querying GBIF for type specimens is potentially fraught with error.</p>
<p>Hence JSTOR has often cleaner and more accurate data. But it is behind a paywall. Hence I set about to get a list of all the type specimens that JSTOR has, and try and match those to GBIF. This would give me a sense of how much content behind JSTOR’s paywall was freely available in GBIF, as well as how much content JSTOR had that was absent from GBIF. I also wanted to use JSTOR’s reference to the original plant name to get around any GBIF’s tendency to link types to the wrong name.</p>
<h2 id="challenges">Challenges</h2>
<p>Mapping JSTOR barcodes to records in GBIF proved challenging. In an ideal world specimens would have a single identifier that everyone would use when citing or otherwise referring to that specimen. Of course this is not the case. There are all manner of identifiers, ranging from barcodes, collector names and numbers, local database keys (integers, UUIDs, and anything in between). Some identifiers include version codes. All of this greatly complicates linking barcodes to GBIF records. I made extensive use of my <a href="https://material-examined.herokuapp.com">Material examined</a> tool that attempts to translate specimen codes into GBIF records. Under the hood this means lots of regular expressions, and I spent a lot of time adding code to handle all the different ways herbaria manage to mangle barcodes.</p>
<p>In some cases JSTOR barcodes are absent from the specimen information in the GBIF occurrence record itself but are hidden in metadata for the image (such as the URL to the image). My “Material examined” tool uses the GBIF API, and that doesn’t enable searches for parts of image URLs. Hence for some herbaria I had to download the archive, extract media URLs and look for barcodes. In the process I encountered a subtle bug in Safari that truncated downloads, see <a href="https://discourse.gbif.org/t/downloads-failing-to-include-all-files-in-the-archive/4159">Downloads failing to include all files in the archive</a>.</p>
<p>Some herbaria have data in both JSTOR and GBIF, but no identifiers in common (other than collector names and numbers, which would require approximate string matching). But in some cases the herbaria have their own web sites which mention the JSTOR barcodes, as well as the identifiers those herbaria do share with GBIF. In these cases I would attempt to scrape the herbaria web sites, extract the barcode and original identifier, then find the original identifier in GBIF.</p>
<p>Another observation is that in some cases the imagery in JSTOR is not the same as GBIF. For example <a href="https://plants.jstor.org/stable/10.5555/al.ap.specimen.lisc002383">LISC002383</a> and <a href="https://www.gbif.org/occurrence/813346859">813346859</a> are the same specimens but the images are different. Why are the images provided to JSTOR not being provided to GBIF?</p>
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj_rl3EqL5DF98LVrbGz4qMDFIM7nCr5EA4QUgzjO2cyRjcUlqvopB3RPSJ8J5Xw6XCXirAc8R8nRUZM_WtcY5xbtAX1Op7ZI3Va7JH2oH90cMniQHapcnMWPBiMb7QqAEIwoyZGUiJZ78g-Akei1c6LvN0Sa2cfahoNtz_EiF0L5s6K_bJyIkp/s200/lisc002383.jpg" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" height="320" data-original-height="200" data-original-width="152" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj_rl3EqL5DF98LVrbGz4qMDFIM7nCr5EA4QUgzjO2cyRjcUlqvopB3RPSJ8J5Xw6XCXirAc8R8nRUZM_WtcY5xbtAX1Op7ZI3Va7JH2oH90cMniQHapcnMWPBiMb7QqAEIwoyZGUiJZ78g-Akei1c6LvN0Sa2cfahoNtz_EiF0L5s6K_bJyIkp/s320/lisc002383.jpg"/></a></div><div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhx2z-YmO1MqVdO0qvMaVbpd4fka-jXcl6g_r8mN6QuLgje8KCVuWJWh2gw_fKE9BJ1kFZ9tewx5fbdHkh4Jm1rd8agdl1JFMYAQ9Ku-3fu3OW8emjKm39CV2KTASNNZZHdm9fhrlbAB9S75CxxYmgGElZXVD4M0q8OhJ46elIPFaRYj2SBSznL/s200/b6f10ba1bb2460a11fa9b0d36f1ce0a6.jpeg" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" height="320" data-original-height="200" data-original-width="133" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhx2z-YmO1MqVdO0qvMaVbpd4fka-jXcl6g_r8mN6QuLgje8KCVuWJWh2gw_fKE9BJ1kFZ9tewx5fbdHkh4Jm1rd8agdl1JFMYAQ9Ku-3fu3OW8emjKm39CV2KTASNNZZHdm9fhrlbAB9S75CxxYmgGElZXVD4M0q8OhJ46elIPFaRYj2SBSznL/s320/b6f10ba1bb2460a11fa9b0d36f1ce0a6.jpeg"/></a></div>
<p>In the process of making this mapping it became clear that there are herbaria that aren’t in GBIF, for example Singapore (SING) is not in GBIF but instead is hosted at Oxford University (!) at <a href="https://herbaria.plants.ox.ac.uk/bol/sing">https://herbaria.plants.ox.ac.uk/bol/sing</a>. There seem to be a number of herbaria that have content in JSTOR but not GBIF, hence GBIF has gaps in its coverage of type specimens.</p>
<p>Interestingly JSTOR rarely seems to be a destination for links. An exception is the Paris museum, for example specimens <a href="https://science.mnhn.fr/institution/um/collection/mpu/item/mpu015018">MPU015018</a> has a link to JSTOR record for same specimen <a href="http://plants.jstor.org/specimen/MPU015018">MPU015018</a>.</p>
<h2 id="matching-taxonomic-names">Matching taxonomic names</h2>
<p>As a check on matching JSTOR to GBIF I would also check that the taxonomic names associated with the two records are the same. The challenge here is that the names may have changed. Ideally both JSTOR and GBIF would have either a history of name changes, or at least the original name the specimen was associated with (i.e., the name for which the specimen is the type). And of course, this isn’t the case. So I relied on a series of name comparisons, such as “are the names the same?”, “if names are different are the specific epithets the same?”, and “if names are specific epithets are different are the generic names the same?”. Because the spelling of species names can change depending on the gender of the genus, I also used some stemming rules to catch names that were the same even if their ending was different.</p>
<p>This approach will still miss some matches, such as hybrid names, and cases where a specimen is stored under a completely different name (e.g., the original name is a heterotypic synonym of a different name).</p>
<h2 id="mapping">Mapping</h2>
<p>The mapping made so far is available on GitHub <a href="https://github.com/rdmpage/jstor-plant-specimens">https://github.com/rdmpage/jstor-plant-specimens</a> and Zenodo <a href="https://doi.org/10.5281/zenodo.10044359">https://doi.org/10.5281/zenodo.10044359</a>.</p>
<p>At the time of writing I have retrieved 1,354,861 records for type specimens from JSTOR, of which 903,945 (67%) have been matched to GBIF.</p>
<p>This has been a sobering lesson in just how far we are from being able to treat specimens as citable things, we simply don’t have decent identifiers for them. JSTOR made a lot of progress, but that has been hampered by being behind a paywall, and the fact that many of these identifiers are being lost or mangled by the time they make their way into GBIF, which is arguably where most people get information on specimens.</p>
<p>There’s an argument that it would be great to get JSTOR Global Plants into GBIF. It would certainly add a lot of extra images, and also provide a presence for a number of smaller herbaria that aren’t in GBIF. I think there’s also a case to be made for having a <a href="https://www.gbif.org/hosted-portals">GBIF hosted portal</a> for plant type specimens, to help make these valuable objects more visible and discoverable.</p>
<p>Below is a barchart of the top 50 herbaria ranked by number of type specimens in JSTOR, showing the numbers of specimens mapped to GBIF (red) and those not found (blue).</p>
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh6JvmpGTkkKnfPAaaPaLwfyffVImyk_pN6QYeDHFGKMlC7RgWnxoK6oBtMkvOQF0DxO5f9eoUph3nlKwT0_BoJ-LG089ChLfI_1xWIoBgYFYR0CIAalTgArQLH1rBEecbnWwkDbpv4QKS-5vQL3dRRLqKmcL2TEZNga7yM-cPwdRXSC68MrKGY/s948/JSTOR%20type%20specimens%20in%20GBIF.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" height="400" data-original-height="948" data-original-width="754" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh6JvmpGTkkKnfPAaaPaLwfyffVImyk_pN6QYeDHFGKMlC7RgWnxoK6oBtMkvOQF0DxO5f9eoUph3nlKwT0_BoJ-LG089ChLfI_1xWIoBgYFYR0CIAalTgArQLH1rBEecbnWwkDbpv4QKS-5vQL3dRRLqKmcL2TEZNga7yM-cPwdRXSC68MrKGY/s400/JSTOR%20type%20specimens%20in%20GBIF.png"/></a></div>
<h2 id="reading">Reading</h2>
<ul>
<li>
<p>Boyle, B., Hopkins, N., Lu, Z. et al. The taxonomic name resolution service: an online tool for automated standardization of plant names. BMC Bioinformatics 14, 16 (2013). <a href="https://doi.org/10.1186/1471-2105-14-16">https://doi.org/10.1186/1471-2105-14-16</a></p>
</li>
<li>
<p><a href="https://cetaf.org/best-practices/cetaf-stable-identifiers-csi-2/">CETAF Stable Identifiers (CSI)</a></p>
</li>
<li>
<p><a href="http://herbal.rbge.info/md.php?q=implementers">CETAF Specimen URI Tester</a></p>
</li>
<li>
<p>Holttum, R. E. (1950). The Zingiberaceae of the Malay Peninsula. Gardens’ Bulletin, Singapore, 13(1), 1-249. <a href="https://biostor.org/reference/163926">https://biostor.org/reference/163926</a></p>
</li>
<li>
<p>Hyam, R.D., Drinkwater, R.E. & Harris, D.J. Stable citations for herbarium specimens on the internet: an illustration from a taxonomic revision of Duboscia (Malvaceae) Phytotaxa 73: 17–30 (2012). <a href="https://doi.org/10.11646/phytotaxa.73.1.4">https://doi.org/10.11646/phytotaxa.73.1.4</a></p>
</li>
<li>
<p>Rees T (2014) Taxamatch, an Algorithm for Near (‘Fuzzy’) Matching of Scientific Names in Taxonomic Databases. PLoS ONE 9(9): e107510. <a href="https://doi.org/10.1371/journal.pone.0107510">https://doi.org/10.1371/journal.pone.0107510</a></p>
</li>
<li>
<p>Ryan D (2018) Global Plants: A Model of International Collaboration . Biodiversity Information Science and Standards 2: e28233. <a href="https://doi.org/10.3897/biss.2.28233">https://doi.org/10.3897/biss.2.28233</a></p>
</li>
<li>
<p>Ryan, D. (2013), THE GLOBAL PLANTS INITIATIVE CELEBRATES ITS ACHIEVEMENTS AND PLANS FOR THE FUTURE. Taxon, 62: 417-418. <a href="https://doi.org/10.12705/622.26">https://doi.org/10.12705/622.26</a></p>
</li>
<li>
<p>(2016), Global Plants Sustainability: The Past, The Present and The Future. Taxon, 65: 1465-1466. <a href="https://doi.org/10.12705/656.38">https://doi.org/10.12705/656.38</a></p>
</li>
<li>
<p>Smith, G.F. and Figueiredo, E. (2013), Type specimens online: What is available, what is not, and how to proceed; Reflections based on an analysis of the images of type specimens of southern African Polygala (Polygalaceae) accessible on the worldwide web. Taxon, 62: 801-806. <a href="https://doi.org/10.12705/624.5">https://doi.org/10.12705/624.5</a></p>
</li>
<li>
<p>Smith, R. M. (1986). New combinations in Etlingera Giseke (Zingiberaceae). Notes from the Royal Botanic Garden Edinburgh, 43(2), 243-254.</p>
</li>
<li>
<p>Anna Svensson; Global Plants and Digital Letters: Epistemological Implications of Digitising the Directors’ Correspondence at the Royal Botanic Gardens, Kew. Environmental Humanities 1 May 2015; 6 (1): 73–102. doi: <a href="https://doi.org/10.1215/22011919-3615907">https://doi.org/10.1215/22011919-3615907</a></p>
</li>
</ul>
<blockquote>
<p>Written with <a href="https://stackedit.io/">StackEdit</a>.</p>
</blockquote>
Roderic Pagehttp://www.blogger.com/profile/00269598293846172649noreply@blogger.comtag:blogger.com,1999:blog-16081779.post-4188889662973589112023-08-31T13:33:00.004+01:002024-02-07T10:32:07.677+00:00Document layout analysis<div style="padding:1em;border:1px solid #DDD;border-radius:0.2em;">How to cite: <strong>Page, R. (2023). Document layout analysis. <a href="https://doi.org/10.59350/z574z-dcw92">https://doi.org/10.59350/z574z-dcw92</a></strong></div>
<p>Some notes to self on document layout analysis.</p>
<p>I’m revisiting the problem of taking a PDF or a scanned document and determining its structure (for example, where is the title, abstract, bibliography, where are the figures and their captions, etc.). There are lots of papers on this topic, and lots of tools. I want something that I can use to process both born-digital PDFs and scanned documents, such as the <a href="https://www.abbyy.com">ABBYY</a>, <a href="https://en.wikipedia.org/wiki/DjVu">DjVu</a> and <a href="https://en.wikipedia.org/wiki/HOCR">hOCR</a> files on the Internet Archive. PDFs remain the dominant vehicle for publishing taxonomic papers, and aren’t going away any time soon (see Pettifer et al. for a nuanced discussion of PDFs).</p>
<p>There are at least three approaches to document layout analysis.</p>
<h2 id="rule-based">Rule-based</h2>
<p>The simplest approach is to come up rules, such as “if the text is large and it’s on the first page, it’s the title of the article”. Examples of more sophisticated rules are given in Klampfl et al., Ramakrishnan et al., and Lin. Rule-based methods can get you a long way, as shown by projects such as <a href="https://plazi.org">Plazi</a>. But there are always exceptions to rules, and so the rules need constant tweaking. At some point it makes sense to consider probabilistic methods that allow for uncertainty, and which can also “learn”.</p>
<h2 id="large-language-models-llms">Large language models (LLMs)</h2>
<p>At the other extreme are Large language models (LLMs), which have got a lot of publicity lately. There are a number of tools that use LLMs to help extract information from documents, such as LayoutLM (Xu et al.), <a href="https://layout-parser.github.io">Layout Parser</a>, and VILA (Shen et al.). These approaches encode information about a document (in some case including the (<em>x</em>,<em>y</em>) coordinates of individual words on a page) and try and infer which category each word (or block of text) belongs to. These methods are typically coded in Python, and come with various tools to display regions on pages. I’ve had variable success getting these tools to work (I am new to Python, and am also working on a recent Mac which is not the most widely used hardware for machine learning). I have got other ML tools to work, such as an Inception-based model to classify images (see <a href="https://iphylo.blogspot.com/2023/07/adventures-in-machine-learning.html">Adventures in machine learning: iNaturalist, DNA barcodes, and Lepidoptera</a>), but I’ve not succeeded in training these models. There are obscure Python error messages, some involving Hugging Face, and eventually my patience wore out.</p>
<p>Another aspect of these methods is that they often package everything together, such that they take a PDF, use OCR or ML methods such as <a href="https://ai.meta.com/tools/detectron/">Detectron</a> to locate blocks, then encode the results and feed them to a model. This is great, but I don’t necessarily want the whole package, I want just some parts of it. Nor does the prospect of lengthy training appeal (even if I could get it to work properly).</p>
<p>The approach that appealed the most is VILA, which doesn’t use (<em>x</em>,<em>y</em>) coordinates directly but instead encodes information about “blocks” into text extracted from a PDF, then uses an LLM to infer document structure. There is a simple demo at <a href="https://huggingface.co/allenai/vila-scibert-cased-s2vl">Hugging Face</a>. After some experimentation with the code, I’ve ended up using the way VILA represents a document (a JSON file with a series of pages, each with lists of words, their positions, and information on lines, blocks, etc.) as the format for my experiments. If nothing else this means that if I go back to trying to train these models I will have data already prepared in an appropriate format. I’ve also decided to follow VILA’s scheme for labelling words and blocks in a document:</p>
<ul>
<li>Title</li>
<li>Author</li>
<li>Abstract</li>
<li>Keywords</li>
<li>Section</li>
<li>Paragraph</li>
<li>List</li>
<li>Bibliography</li>
<li>Equation</li>
<li>Algorithm</li>
<li>Figure</li>
<li>Table</li>
<li>Caption</li>
<li>Header</li>
<li>Footer</li>
<li>Footnote</li>
</ul>
<p>I’ve tweaked this slightly by adding two additional tags<br>
from VILA’s <a href="https://docs.google.com/document/d/1frGmzYOHnVRWAwTOuuPfc3KVAwu-XKdkFSbpLfy78RI/edit?usp=sharing">Labeling Category Reference</a>, the “semantic” tags “Affiliation” and “Venue”. This helps separate information on author names (“Author”) from their affiliations, which can appear in very different positions to the author’s names. “Venue” is useful to label things such as a banner at the top of an article where the publisher display the name of the journal, etc.</p>
<h2 id="conditional-random-fields">Conditional random fields</h2>
<p>In between masses of regular expressions and large language models are approaches such as <a href="https://en.wikipedia.org/wiki/Conditional_random_field">Conditional random fields</a> (CRFs), which I’ve used before to parse citations (see <a href="https://iphylo.blogspot.com/2021/07/citation-parsing-released.html">Citation parsing tool released</a>). Well known tools such as <a href="https://github.com/kermitt2/grobid">GROBID</a> use this approach.</p>
<p>CRFs are fast, and somewhat comprehensible. But it does require <a href="https://en.wikipedia.org/wiki/Feature_engineering">Feature engineering</a>, that is, you need to come up with features of the data to help train the model (for the systematists among you, this is very like coming up with characters for a bunch of taxa). This is were you can reuse the rules developed in a rules-based approach, but instead of having the rules make decisions (e.g., “big text = Title”), you just a rule that detects whether text is big or not, and the model combined with training data then figures out if and when big text means “Title”. So you end up spending time trying to figure out how to represent document structure, and what features help the model get the right answer. For example, methods such as Lin’s for detecting whether there are recurring elements in a document are great source of features to help recognise headers and footers. CRFs also make it straightforward to include dependencies (the “conditional” in the name). For example, a bibliography in a paper can be recognised not just by a line having a year in it (e.g., “2020”), but there being nearby lines that also have years in them. This helps us avoid labelling isolated lines with years as “Bibliography” when they are simply text in a paragraph that mentions a year.</p>
<p>Compared to LLMs this a lot of work. In principle with an LLM you “just” take a lot of training data (e.g., text and location on a page) and let the model to the hard work of figuring out which bit of the document corresponds to which category (e.g., title, abstract, paragraph, bibliography). The underlying model has already been trained on (potentially) vast amounts of text (and sometimes also word coordinates). But on the plus side, training CRFs is very quick, and hence you can experiment with adding or removing features, adding training data, etc. For example, I’ve started training with about ten (10) documents, training takes seconds, and I’ve got serviceable results.</p>
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi02en_43jWD45tsJ9bVqrJFcVG_NYr_obOmFGzT2PuW-3AbvwMbtt7y43TaGYqaQL4tURGVA0Bm_e6DvmN6kGiROtqFYj_Tjco0KeXnCwUr2ASqewCNjbunTa7nmMVp8jKPWk0B6BHLSbZrz85mzq6d78XNRgG1G3g9MupScfxBbO14B89kfMS/s855/Screenshot%202023-08-31%20at%2011.54.13.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" height="400" data-original-height="855" data-original-width="596" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi02en_43jWD45tsJ9bVqrJFcVG_NYr_obOmFGzT2PuW-3AbvwMbtt7y43TaGYqaQL4tURGVA0Bm_e6DvmN6kGiROtqFYj_Tjco0KeXnCwUr2ASqewCNjbunTa7nmMVp8jKPWk0B6BHLSbZrz85mzq6d78XNRgG1G3g9MupScfxBbO14B89kfMS/s400/Screenshot%202023-08-31%20at%2011.54.13.png"/></a></div>
<p>Lots of room for improvement, but there’s a constant feedback loop of seeing improvements, and thinking about how to tweak the features. It also encourages me to think about what went wrong.</p>
<h2 id="problems-with-pdf-parsing">Problems with PDF parsing</h2>
<p>To process PDFs, especially “born digital” PDFs I rely on <a href="https://github.com/rdmpage/pdf2xml">pdf2xml</a>, originally written by Hervé Déjean (Xerox Research Centre Europe). It works really well, but I’ve encountered a few issues. Some can be fixed by adding more fonts to my laptop (from <a href="http://www.xpdfreader.com/download.html">XpdfReader</a>), but others are more subtle.</p>
<p>The algorithm used to assign words to “blocks” (e.g., paragraphs) seems to struggle with superscripts (e.g., <sup>1</sup>), which often end up being treated as separate blocks. This breaks up lines of text, and also makes it harder to accurately label parts of the document such as “Author” or “Affiliation”.</p>
<p>Figures can also be problematic. Many are simply bitmaps embedded in a PDF and can be easily extracted, but sometimes labelling on those bitmaps, or indeed big chunks of vector diagrams are treated as text, so we end up with story text blocks in odd positions. I need to spend a little time thinking about this as well. I also need to understand the “vet” format pdftoxml extracts from PDFs.</p>
<p>PDFs also have all sorts of quirks, such as publishers slapping cover pages on the front, which make feature engineering hard (the biggest text might now be not be the title but some cruff from the publisher). Sometimes there are clues in the PDF that it has been moodier.! For example, ResearchGate inserts a “rgid” tag in the PDF when it adds a cover page.</p>
<h2 id="yes-but-why">Yes but why?</h2>
<p>So, why I am doing this? Why battle with the much maligned PDF format. It’s because a huge chunk of taxonomic and other information is locked up in PDFs, and I’d like a simpler, scalable, way to extract some of that. Plazi is obviously the leader in this are in terms of the amount of information they have extracted, but their approach is labour-intensive. I want something that is essentially automatic, that can be trained to handle the idiosyncracities of the taxonomic literature, and can be applied to both born digital PDFs and OCR from scans in the Biodiversity Heritage Library and elsewhere. Even if we could simply extract bibliographic information (to flesh out the citation graph) and the figures, that would be progress.</p>
<h2 id="references">References</h2>
<p>Déjean H, Meunier J-L (2006) A System for Converting PDF Documents into Structured XML Format. In: Bunke H, Spitz AL (eds) Document Analysis Systems VII. Springer, Berlin, Heidelberg, pp 129–140 <a href="https://doi.org/10.1007/11669487_12">https://doi.org/10.1007/11669487_12</a></p>
<p>Klampfl S, Granitzer M, Jack K, Kern R (2014) Unsupervised document structure analysis of digital scientific articles. Int J Digit Libr 14(3):83–99. <a href="https://doi.org/10.1007/s00799-014-0115-1">https://doi.org/10.1007/s00799-014-0115-1</a></p>
<p>Lin X (2003) Header and footer extraction by page association. In: Document Recognition and Retrieval X. SPIE, pp 164–171 <a href="https://doi.org/10.1117/12.472833">https://doi.org/10.1117/12.472833</a></p>
<p>Pettifer S, McDERMOTT P, Marsh J, Thorne D, Villeger A, Attwood TK (2011) Ceci n’est pas un hamburger: modelling and representing the scholarly article. Learned Publishing 24(3):207–220. <a href="https://doi.org/10.1087/20110309">https://doi.org/10.1087/20110309</a></p>
<p>Ramakrishnan C, Patnia A, Hovy E, Burns GA (2012) Layout-aware text extraction from full-text PDF of scientific articles. Source Code for Biology and Medicine 7(1):7. <a href="https://doi.org/10.1186/1751-0473-7-7">https://doi.org/10.1186/1751-0473-7-7</a></p>
<p>Shen Z, Lo K, Wang LL, Kuehl B, Weld DS, Downey D (2022) VILA: Improving Structured Content Extraction from Scientific PDFs Using Visual Layout Groups. Transactions of the Association for Computational Linguistics 10:376–392. <a href="https://doi.org/10.1162/tacl_a_00466">https://doi.org/10.1162/tacl_a_00466</a></p>
<p>Xu Y, Li M, Cui L, Huang S, Wei F, Zhou M (2020) LayoutLM: Pre-training of Text and Layout for Document Image Understanding. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. pp 1192–1200</p>
<blockquote>
<p>Written with <a href="https://stackedit.io/">StackEdit</a>.</p>
</blockquote>
Roderic Pagehttp://www.blogger.com/profile/00269598293846172649noreply@blogger.comtag:blogger.com,1999:blog-16081779.post-61520002558684270332023-08-03T16:42:00.004+01:002024-02-07T10:33:57.398+00:00The problem with GBIF's Phylogeny Explorer<div style="padding:1em;border:1px solid #DDD;border-radius:0.2em;">How to cite: <strong>Page, R. (2023). The problem with GBIF’s Phylogeny Explorer. <a href="https://doi.org/10.59350/v0bt3-zp114">https://doi.org/10.59350/v0bt3-zp114</a></strong></div>
<p>GBIF recently released the <a href="https://www.legumedata.org/phylogeny/explore/">Phylogeny Explorer</a>, using legumes as an example dataset. The goal is to enables users to “view occurrence data from the GBIF network aligned to legume phylogeny.” The screenshot below shows the legume phylogeny side-by-side with GBIF data.</p>
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhftmvn7WXupSva1ZoslH4BKCZbyv5IbShT09dK1Ix11TwIBJSbhA-PafXFg2e2xAf3-FeUbYv4eUwZdpm54xS2Si-88fKqjaEec8hEIbPvO0WBacyR6MQ_-61ZwL1mWRdR2bofLyomz4SbDDzdNtZ5sSw3buTv_7c84jL6JpqkXMz1WR7wXEP1/s1040/Screenshot%202023-08-03%20at%2015.19.02.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" width="400" data-original-height="859" data-original-width="1040" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhftmvn7WXupSva1ZoslH4BKCZbyv5IbShT09dK1Ix11TwIBJSbhA-PafXFg2e2xAf3-FeUbYv4eUwZdpm54xS2Si-88fKqjaEec8hEIbPvO0WBacyR6MQ_-61ZwL1mWRdR2bofLyomz4SbDDzdNtZ5sSw3buTv_7c84jL6JpqkXMz1WR7wXEP1/s400/Screenshot%202023-08-03%20at%2015.19.02.png"/></a></div>
<p>Now, I’m all in favour of integrating phylogenies and occurrence data, and I have a lot of respect for the people behind this project (Morten Høfft and Thomas Stjernegaard Jeppesen), but I think this way of displaying a phylogeny has multiple problems. Indeed, it suffers from many of the classic “mistakes” people make when trying to view big trees.</p>
<h2 id="why-maps-work">Why maps work</h2>
<p>Tree visualisation is a challenging problem. I wrote a somwhwat out of date review on this topic a decade ago, and Googling will find many papers on the topic. There is also the amazing <a href="https://treevis.net">treevis.net</a>.</p>
<p>I think the key issues can be seen once we compare the tree on the left with the map on the right. The map allows zooming in and out, and it does this equally in both the <em>x</em> and <em>y</em> dimensions. In other words, when you zoom in the map expands left to right and top to bottom. This makes sense because the map is a square. Obviously the Earth is not a square, but the projection used by web maps (such as Google Maps, OpenStreetMap, etc.) treats the world as one. Below is the world at zoom level 0, a 256 x 256 pixel square.</p>
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiqDLicNwFiBzHeH6NLOwFv7rMCqSFnKFPbcmQHR3Z7POedr6glY-1ZlHEIJqbrrLJuWSsxodSIkrZwoNg77WfA_8U2A7ICrCl84uGLy09SAV2KlReNexwFrMRNS2NghYetOODN2BXr-3ep0-ewMIab6yDPAfW1Ut_7j9g9Fwu2s9PSBXhfb7mW/s256/0.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" width="320" data-original-height="256" data-original-width="256" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiqDLicNwFiBzHeH6NLOwFv7rMCqSFnKFPbcmQHR3Z7POedr6glY-1ZlHEIJqbrrLJuWSsxodSIkrZwoNg77WfA_8U2A7ICrCl84uGLy09SAV2KlReNexwFrMRNS2NghYetOODN2BXr-3ep0-ewMIab6yDPAfW1Ut_7j9g9Fwu2s9PSBXhfb7mW/s320/0.png"/></a></div>
<p>When you zoom in the number of tiles is doubled with each increase in zoom level, and you get a more and more detailed map. As you zoom in on a map typically you see labels appearing and disappearing. These labels are (a) always legible, and (b) they change with zoom level. Continent names appear before cities, but disappear once you’ve zoomed in to country level or below.</p>
<p>To summarise, the map visualisation <strong>zooms appropriately</strong>, always has <strong>legible labels</strong>, and the <strong>level of detail</strong> and labelling changes with zoom level. None of this is true for the GBIF phylogeny viewer.</p>
<h2 id="the-phylogeny-problem">The phylogeny problem</h2>
<p>The screenshot below shows GBIF’s display of the legume tree such that the whole tree fits into the window. No labels are visible, and the tree structure is hard to see. There are no labels for major groups, so we have no obvious way to find our way around the tree.</p>
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgKx_Yf5ZJf3cWMA-lm3-PwVZ3k_eCALXH4MYNCEHL3QZcs2FwSWlFf4MkRYP2ldJNw0pov1DIhKcGebAjTT6j8JIaQSiizSegyj3OFk-7jP3OhYY_GMbvQsBq4EkhP76tSOfi57uaZ43XbMwOkOtSnvKIZpd9vjF87JC4LmZ8DlscK4Ll_0PVV/s903/Screenshot%202023-08-03%20at%2015.51.49.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" height="400" data-original-height="903" data-original-width="544" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgKx_Yf5ZJf3cWMA-lm3-PwVZ3k_eCALXH4MYNCEHL3QZcs2FwSWlFf4MkRYP2ldJNw0pov1DIhKcGebAjTT6j8JIaQSiizSegyj3OFk-7jP3OhYY_GMbvQsBq4EkhP76tSOfi57uaZ43XbMwOkOtSnvKIZpd9vjF87JC4LmZ8DlscK4Ll_0PVV/s400/Screenshot%202023-08-03%20at%2015.51.49.png"/></a></div>
<p>We can zoom so that we can see the labels, but everything is zoomed, such that we can’t see all the tree structure to the left.</p>
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgmv-GCYtb-v5YscQQvDHfTYojNmXoW2vBR3gn1gCXj4DUufuYHy_3OUf5v8fRpSW8026Z6f311thhDEJIwfEZ2WbzyCAYSnZszs5nWqdRPp26kIXaB4qTF50Iipx7ptRoOL27vCafphYUwRVzer1-3uUDozCnXIxs_sAOQ0zo2rDRTV5mJgJxU/s909/Screenshot%202023-08-03%20at%2015.52.38.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" height="400" data-original-height="909" data-original-width="549" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgmv-GCYtb-v5YscQQvDHfTYojNmXoW2vBR3gn1gCXj4DUufuYHy_3OUf5v8fRpSW8026Z6f311thhDEJIwfEZ2WbzyCAYSnZszs5nWqdRPp26kIXaB4qTF50Iipx7ptRoOL27vCafphYUwRVzer1-3uUDozCnXIxs_sAOQ0zo2rDRTV5mJgJxU/s400/Screenshot%202023-08-03%20at%2015.52.38.png"/></a></div>
<p>Indeed, if we zoom in more we rapidly lose sight of most of the tree.</p>
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhD_8iv2A9sIxIwivjDzL7J3bwzs42m0ZD9ZtwJV7IHDdQiJpiCOU5eEg4I21SSu-OLpzlL0-NqPxfhi_3kVuUavHbVbgJKSUm4-o_qwNHFqIVeDV-j06M7FYJRTeLQdUNVpscXt4T3hS63bXoA7GLDV3c8O7ztFlpXn3WpENwPj_gmk5Ke5kcq/s903/Screenshot%202023-08-03%20at%2015.52.47.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" height="400" data-original-height="903" data-original-width="546" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhD_8iv2A9sIxIwivjDzL7J3bwzs42m0ZD9ZtwJV7IHDdQiJpiCOU5eEg4I21SSu-OLpzlL0-NqPxfhi_3kVuUavHbVbgJKSUm4-o_qwNHFqIVeDV-j06M7FYJRTeLQdUNVpscXt4T3hS63bXoA7GLDV3c8O7ztFlpXn3WpENwPj_gmk5Ke5kcq/s400/Screenshot%202023-08-03%20at%2015.52.47.png"/></a></div>
<p>This is one of the challenges presented by trees. Like space, they are mostly empty. hence simply zooming in is often not helpful.</p>
<p>So, the <strong>zooming doesn’t correspond to the structure of the tree</strong>, labels are <strong>often either not legible or absent,</strong> and <strong>levels of detail don’t change</strong> with zooming in and out.</p>
<h2 id="what-can-we-do-differently">What can we do differently?</h2>
<p>I’m going to sketch an alternative approach to viewing trees like this. I have some ropey code that I’ve used to create the diagrams below. This isn’t ready for prime time, but hopefully illustrates the idea. The key concept is that we zoom NOT by simply expanding the viewing area in the x and y direction, but by collapsing and expanding the tree. Each zoom level corresponds the number of nodes we will show in the tree. We use a criterion to rank the importance of each node in the tree. One approach is how “distinctive” the nodes are, see Jin et al. 2009. We then use a priority queue to chose the nodes to display at a given zoom level (see Libin et al. 2017 and Zaslavsky et al. 2007).</p>
<p>Arguably this gives us a more natural way to zoom a tree, we see the main structure first, then as we zoom in more structure becomes apparent. It turns out if the tree drawing itself is constructed using a “in-order” traversal we can greatly simplify the drawing. Imagine that the tree consists of a number of nodes (both internal and external, i.e., leaves and hypothetical ancestors), and we draw each node on a single line (as if we were using a line printer). Collapsing or expanding the tree is simply a matter of removing or adding lines. If a node is not visible we don’t draw it. If a leaf node is visible we show it as if the whole tree was visible. Internal nodes are slightly different. If it is visible but collapsed we can draw it with a triangle representing the descendants, if it is not collapsed then we draw it as if the whole tree was visible. The end result is that we don’t need to recompute the tree as we zoom in or out, we simply compute which nodes to show, and in what state.</p>
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhRQrmEqAugTRQI81eezwa1miPohTMcqCuIWgIkoB5hJIkI6biSm9H20-XJojcX0vRZNPaLHYMdmeizonUc68LRFaLbMGe9GYTWqHA-IcNU-0fyjmJAJHI4tdY8GuZL4WHng8c5GXVjRvdDnD_6QjHYSqcPEg4kg4LqbzJTAalQisXPOUjydLLt/s4178/sciadv.ade4954-f1.jpg" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" height="400" data-original-height="4178" data-original-width="3600" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhRQrmEqAugTRQI81eezwa1miPohTMcqCuIWgIkoB5hJIkI6biSm9H20-XJojcX0vRZNPaLHYMdmeizonUc68LRFaLbMGe9GYTWqHA-IcNU-0fyjmJAJHI4tdY8GuZL4WHng8c5GXVjRvdDnD_6QjHYSqcPEg4kg4LqbzJTAalQisXPOUjydLLt/s400/sciadv.ade4954-f1.jpg"/></a></div>
<p>As an experiment I decided to explore the legume tree used in the GBIF website. As is sadly so typical, the original publication of the tree (Ringelberg et al. 2023) doesn’t provide the actual tree, but I found a JSON version on GitHub <a href="https://github.com/gbif/hp-legume/tree/master/assets/phylotree">https://github.com/gbif/hp-legume/tree/master/assets/phylotree</a>. I then converted that to Newick format so my tools could use it (had a few bumpy moments when I discovered that the tree has negative branch lengths!). The converted file is here: <a href="https://gist.github.com/rdmpage/ef43ea75a738e75ec303602f76bf0b2e">https://gist.github.com/rdmpage/ef43ea75a738e75ec303602f76bf0b2e</a></p>
<p>I then ran the tree through my code and generated views at various zoom levels.</p>
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1FUT1D89SQTExA0VVPOQn-lt0JotATzm46D3icHDxM1O0V4IsFxRl83xFXRZ1_HJiwkH6r29afPqoBgYGKuMOvzDoYvrb6xFPJqFhr1nJfZEaGbyMefAl3Q2ZZt5yXYmlAhesrxJMMd0JIr098rRo8qOdGQ_BHvqx1FUaRUlRfARTYtzO6slz/s614/Screenshot%202023-08-03%20at%2016.11.20.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" height="400" data-original-height="614" data-original-width="564" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1FUT1D89SQTExA0VVPOQn-lt0JotATzm46D3icHDxM1O0V4IsFxRl83xFXRZ1_HJiwkH6r29afPqoBgYGKuMOvzDoYvrb6xFPJqFhr1nJfZEaGbyMefAl3Q2ZZt5yXYmlAhesrxJMMd0JIr098rRo8qOdGQ_BHvqx1FUaRUlRfARTYtzO6slz/s400/Screenshot%202023-08-03%20at%2016.11.20.png"/></a></div>
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhYfrB-XdQJ68he0iL4Hm2ciXo_YLqSGOuovznXIgxsG3Q-Goc28Wf8QUgB-uivwBWouBAhvWW02KfJ8jq_oOYE8MhYx7L9sfvNAwjBRvV9bsCtyP2VxXvPZBOOntgjq6jTHN8ihhiyH0XtvI0RbXeuGKPnIS_VcoUUQVnDK28tqroh2jhe0Xh2/s1241/Screenshot%202023-08-03%20at%2016.14.56.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" height="600" data-original-height="1241" data-original-width="590" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhYfrB-XdQJ68he0iL4Hm2ciXo_YLqSGOuovznXIgxsG3Q-Goc28Wf8QUgB-uivwBWouBAhvWW02KfJ8jq_oOYE8MhYx7L9sfvNAwjBRvV9bsCtyP2VxXvPZBOOntgjq6jTHN8ihhiyH0XtvI0RbXeuGKPnIS_VcoUUQVnDK28tqroh2jhe0Xh2/s600/Screenshot%202023-08-03%20at%2016.14.56.png"/></a></div>
<p>Note that as the tree expands labels are always legible, and zooming only increased the size of the tree in the <em>y</em>-axis (as the expanded nodes take up more space). Note also that we see a number of isolated taxa appearing, such as <em>Lachesiodendron viridiflorum</em>. These taxa are often of evolutionary interest, and also of high conservation interest due to their phylogenetic isolation. Simply showing the whole tree hides these taxa.</p>
<p>Now, looking at these two diagrams there are two obvious limitations. The first is that the black triangles representing collapsed clades are all the same size regardless of whether they represent a few of many taxa. This could be addressed by adding numbers beside each triangle, using colour to reflect the numebr of collapsed nodes, or perhaps by breaking the “one node per row” rule by drawing particularly large nodes over two or more lines.</p>
<p>The other issue is that most of the triangles lack labels. This is because the tree itself lacks them (I added “Ingoid clade”, for example). There will be lots of nodes which can be labelled (e.g., by genus name), but once we start displaying phylogeny we will need to make use of informal names, or construct labels based on the descendants (e.g., “genus 1 - genus 5”). We can also think of having sets of labels that we locate on the tree by finding the least common ancestor (AKA the most recent common ancestor) of that label (hello Phylocode).</p>
<p>Another consideration is what to do with labels as taxa are expanded?. One approach would be to use shaded regions, for example in the last tree above we could shade the clades rooted at <em>Mimosa</em>, <em>Vachellia</em>, and the “Ingoid clade” (and others if they had labels). If we were clever we could alter which clades are shaded based on the zoom level. If we wanted these regions to not overlap (for example, if we wanted bands of colour corresponding to clades to appear on the right of the tree) then we could use something like <a href="https://en.wikipedia.org/wiki/Maximum_disjoint_set">maximum disjoint sets</a> to choice the best combination of labels.</p>
<h2 id="summary">Summary</h2>
<p>I don’t claim that this alternative visualisation is perfect (and my implementation of it is very far from perfect). but I think it shows that there are ways we can zoom into trees that reflects tree structure, ensures labels are always legible, and that supports levels of detail (collapsed nodes expanding as we zoom). The use of inorder traversal and three styles of node drawing mean that the diagram is simple to render. We don’t need fancy graphics, we can simply have a list of images.</p>
<p>To conclude, I think it’s great GBIF is moving to include phylogenies. But we can't visualise phylogeny as a static image, it's a structure that requires us to think about how to display it with the same level of creativity that makes web maps such a successful visualisation.</p>
<h2 id="reading">Reading</h2>
<p>Jin Chen, MacEachren, A. M., & Peuquet, D. J. (2009). Constructing Overview + Detail Dendrogram-Matrix Views. IEEE Transactions on Visualization and Computer Graphics, 15(6), 889–896. <a href="https://doi.org/10.1109/tvcg.2009.130">https://doi.org/10.1109/tvcg.2009.130</a></p>
<p>Libin, P., Vanden Eynden, E., Incardona, F., Nowé, A., Bezenchek, A., … Sönnerborg, A. (2017). PhyloGeoTool: interactively exploring large phylogenies in an epidemiological context. Bioinformatics, 33(24), 3993–3995. doi:10.1093/bioinformatics/btx535</p>
<p>Page, R. D. M. (2012). Space, time, form: Viewing the Tree of Life. <em>Trends in Ecology & Evolution</em>, <em>27</em>(2), 113–120. <a href="https://doi.org/10.1016/j.tree.2011.12.002">https://doi.org/10.1016/j.tree.2011.12.002</a>)</p>
<p>Ribeiro, P. G., Luckow, M., Lewis, G. P., Simon, M. F., Cardoso, D., De Souza, É. R., Conceição Silva, A. P., Jesus, M. C., Dos Santos, F. A. R., Azevedo, V., & De Queiroz, L. P. (2018). <em>lachesiodendron</em> , a new monospecific genus segregated from <em>piptadenia</em>(Leguminosae: Caesalpinioideae: mimosoid clade): evidence from morphology and molecules. <em>TAXON</em>, <em>67</em>(1), 37–54. <a href="https://doi.org/10.12705/671.3">https://doi.org/10.12705/671.3</a></p>
<p>Ringelberg, J. J., Koenen, E. J. M., Sauter, B., Aebli, A., Rando, J. G., Iganci, J. R., De Queiroz, L. P., Murphy, D. J., Gaudeul, M., Bruneau, A., Luckow, M., Lewis, G. P., Miller, J. T., Simon, M. F., Jordão, L. S. B., Morales, M., Bailey, C. D., Nageswara-Rao, M., Nicholls, J. A., … Hughes, C. E. (2023). Precipitation is the main axis of tropical plant phylogenetic turnover across space and time. <em>Science Advances</em>, <em>9</em>(7), eade4954. <a href="https://doi.org/10.1126/sciadv.ade4954">https://doi.org/10.1126/sciadv.ade4954</a></p>
<p>Zaslavsky L., Bao Y., Tatusova T.A. (2007) An Adaptive Resolution Tree Visualization of Large Influenza Virus Sequence Datasets. In: Măndoiu I., Zelikovsky A. (eds) Bioinformatics Research and Applications. ISBRA 2007. Lecture Notes in Computer Science, vol 4463. Springer, Berlin, Heidelberg. <a href="https://doi.org/10.1007/978-3-540-72031-7_18">https://doi.org/10.1007/978-3-540-72031-7_18</a></p>
<blockquote>
<p>Written with <a href="https://stackedit.io/">StackEdit</a>.</p>
</blockquote>
Roderic Pagehttp://www.blogger.com/profile/00269598293846172649noreply@blogger.comtag:blogger.com,1999:blog-16081779.post-79796682334906179172023-07-28T14:53:00.003+01:002024-02-07T10:39:34.574+00:00Sub-second searching of millions of DNA barcodes using a vector database<div style="padding:1em;border:1px solid #DDD;border-radius:0.2em;">How to cite: <strong>Page, R. (2023). Sub-second searching of millions of DNA barcodes using a vector database. <a href="https://doi.org/10.59350/qkn8x-mgz20">https://doi.org/10.59350/qkn8x-mgz20</a></strong></div>
<p>Recently I’ve been messing about with DNA barcodes. I’m junior author with David Schindel on forthcoming book chapter <a href="https://doi.org/10.5281/zenodo.8092339">Creating Virtuous Cycles for DNA Barcoding: A Case Study in Science Innovation, Entrepreneurship, and Diplomacy</a>, and I’ve blogged about <a href="https://iphylo.blogspot.com/2023/07/adventures-in-machine-learning.html">Adventures in machine learning: iNaturalist, DNA barcodes, and Lepidoptera</a>. One thing I’ve always wanted is a simple way to explore DNA barcodes both geographically and phylogenetically. I’ve made various toys (e.g., <a href="https://iphylo.blogspot.com/2016/05/notes-on-next-steps-for-million-dna.html">Notes on next steps for the million DNA barcodes map</a> and <a href="https://iphylo.blogspot.com/2020/07/dna-barcode-browser.html">DNA barcode browser</a>) but one big challenge has been search.</p>
<p>The goal is to be able to do is take a DNA sequence and search the DNA barcode database for barcodes that are similar to that sequence, then build a phylogenetic tree for the results. And I want this to be fast. The approach I used in my :“DNA barcode browser” was to use Elasticsearch and index the DNA sequences as n-grams (=k-mers). This worked well for small numbers of sequences, but when I tried this for millions of sequences things got very slow, typically it took around eight seconds for a search to complete. This is about the same as BLAST on my laptop for the same dataset. These sort of search times are simply too slow, hence I put this work on the back burner. That is, until I started exploring <a href="https://www.forbes.com/sites/adrianbridgwater/2023/05/19/the-rise-of-vector-databases/">vector databvases</a>.</p>
<p>Vector databases, as the name suggests, store vectors, that is, arrays of numbers. Many of the AI sites currently gaining attention use vector databases. For example, chat bots based on ChatGPT are typically taking text, converting it to an “embedding” (a vector), then searching in a database for similar vectors which, hopefully, correspond to documents that are related to the original query (see <a href="https://iphylo.blogspot.com/2023/04/chatgpt-semantic-search-and-knowledge.html">ChatGPT, semantic search, and knowledge graphs</a>).</p>
<p>The key step is to convert the thing you are interested in (e.g., text, or an image) into an embedding, which is a vector of fixed length that encodes information about the thing. In the case of DNA sequences one way to do this is to use k-mers. These are short, overlapping fragments of the DNA sequence (see <a href="https://iphylo.blogspot.com/2017/03/this-is-what-phylodiversity-looks-like.html">This is what phylodiversity looks like</a>). In the case of k-mers of length 5 the embedding is a vector of the frequencies of the 4<sup>5</sup> = 1,024 different k-mers for the letters A, C, G, and T.</p>
<p>But what do we do with these vectors? This is where the vector database comes in. Search in a vector database is essentially a nearest-neighbour search - we want to find vectors that are similar to our query vector. There has been a lot of cool research on this problem (which is now highly topical because of the burgeoning interest in machine learning), and not only are there vector databases, but tools to add this functionality to existing databases.</p>
<p>So, I decided to experiment. I grabbed a copy of <a href="https://www.postgresql.org">PostgreSQL</a> (not a database I’d used before), added the <a href="https://github.com/pgvector/pgvector">pgvector extension</a>, then created a database with over 9 million DNA barcodes. After a bit of faffing around, I got it to work (code still needs cleaning up, but I will release something soon).</p>
<p>So far the results are surprisingly good. If I enter a nucleotide sequence, such as <a href="https://www.ncbi.nlm.nih.gov/nuccore/JF491468">JF491468</a> (Neacomys sp. BOLD:AAA7034 voucher ROM 118791) and search for the 100 most similar sequences I get back 100 <em>Neacomys</em> sequences in 0.14 seconds(!). I can then take the vectors for each of those sequences (i.e., the array of k-mer frequencies), compute a pairwise distance matrix, then build a phylogeny (in <a href="https://paup.phylosolutions.com">PAUP,*</a> naturally).</p>
<p>Searches this rapid mean we can start to interactively explore large databases of DNA barcodes, as well as quickly take new, unknown sequences and ask “have we seen this before?”</p>
<p>As a general tool this approach has limitations. Vector databases have a limit on the size of vector they can handle, so k-mers much larger than 5 will not be feasible (unless the vectors are sparse in the sense that not all k-mers actually occur). Also it’s not clear to me how much this approach succeeds because of the nature of barcode data. Typically barcodes are either very similar to each other (i.e., from the same species), or they are quite different (the famous “barcode gap”). This may have implications for the success of nearest neighbour searching.</p>
<p>Still early days, but so far this has been a revelation, and opens up some interesting possibilities for how we could explore and interact with DNA barcodes.</p>
<blockquote>
<p>Written with <a href="https://stackedit.io/">StackEdit</a>.</p>
</blockquote>
Roderic Pagehttp://www.blogger.com/profile/00269598293846172649noreply@blogger.comtag:blogger.com,1999:blog-16081779.post-13693007763707837672023-07-18T18:22:00.005+01:002024-02-07T10:40:45.151+00:00What, if anything, is the Biodiversity Knowledge Hub?<div style="padding:1em;border:1px solid #DDD;border-radius:0.2em;">How to cite: <strong>Page, R. (2023). What, if anything, is the Biodiversity Knowledge Hub? <a href="https://doi.org/10.59350/axeqb-q2w27">https://doi.org/10.59350/axeqb-q2w27</a></strong></div>
<p>To much fanfare BiCIKL launched the “Biodiversity Knowledge Hub” (see <a href="https://bicikl-project.eu/news/BKH">Biodiversity Knowledge Hub is online!!!</a>). This is advertised as a “game-changer in scientific research”. The snappy video in the launch <a href="https://twitter.com/bicikl_h2020/status/1681222449851924481">tweet</a> claims that the hub will</p>
<ul>
<li>it will help your research thanks to interlinked data…</li>
<li>…and responds to complex queries with the services provided…</li>
</ul>
<div class="separator" style="clear: both; text-align: center;"><iframe allowfullscreen='allowfullscreen' webkitallowfullscreen='webkitallowfullscreen' mozallowfullscreen='mozallowfullscreen' width='400' height='322' src='https://www.blogger.com/video.g?token=AD6v5dw97DbN2ubfhSE0wS3Tw0-JYI5EAijy7NwQeo6aNePOUHI1AlQMYg8jckTxwQBHgyxT0s7sVxHVLYw' class='b-hbp-video b-uploaded' frameborder='0'></iframe></div>
<p>Interlinked data, complex queries, this all sounds very impressive. The video invites us to “Vist the Biodiversity Knowledge Hub and give it a shot”. So I did.</p>
<p>The first thing that strikes me is the following:</p>
<blockquote>
<p>Disclaimer: The partner Organisations and Research Infrastructures are fully responsible for the provision and maintenance of services they present through BKH. All enquiries about a particular service should be sent directly to its provider.</p>
</blockquote>
<p>If the organisation trumpeting a new tool takes no responsibility for that tool, then that is a red flag. To me it implies that they are not taking this seriously, they have no skin in the game. If this work mattered you’d have a vested interest in seeing that it actually worked.</p>
<p>I then tried to make sense of what the Hub is an what it offers.</p>
<p>Is it maybe an <strong>aggregation</strong>? Imagine diversity biodiversity datasets linked together in a single place so that we could seamlessly query across that data, bouncing from taxa to sequences to ecology and more. GBIF and GenBank are examples of aggregations here data is brought together, cleaned, reconciled, and services built on top of that. You can go to GBIF and get distribution data for a species, you can go to GenBank and compare your sequence with millions of others. Is the Hub an aggregation?.. no, it is not.</p>
<p>Is it a<strong>federation</strong>? Maybe instead of merging data from multiple sources, that data lives on the original sites, but we can query across it a bit like a travel search engine queries across multiple airlines to find us the best flight. The data still needs to be reconciled, or at least share identifiers and vocabularies. Is the Hub a federation?.. no, it is not.</p>
<p>OK, so maybe we still have data in separate silos, but maybe the Hub is a <strong>data catalogue</strong> where we can search for data using text terms (a bit like <a href="https://datasetsearch.research.google.com">Google’s Dataset Search</a>)? Or even better, maybe it describes the data in machine readable terms so that we could find out what data are relevant to our interests (e.g., what data sets deal with taxa and ecological associations base don sequence data?). Is it a data catalogue? … no, it is not.</p>
<p>OK, then what actually is it?</p>
<p>It is a <strong>list</strong>. They built a list. If you go to <a href="https://biodiversityknowledgehub.eu/fair-data-place/">FAIR DATA PLACE</a> you see an invitation to <a href="https://biodiversityknowledgehub.eu/fair-data-place/explore-linked-data/">EXPLORE LINKED DATA</a>. Sounds inviting (“linked data, oohhh”) but it’s a list of a few projects: ChecklistBank, e-Biodiv, LifeBlock, OpenBiodiv, PlutoF, Biodiversity PMC, Biotic Interactions Browser, SIBiLS SPARQL Endpoint, Synospecies, and TreatmentBank.</p>
<p>These are not in any way connected, they all have distinct APIs, different query endpoints, speak different languages (e.g., REST, SPARQL), and there’s no indication that they share identifiers even if they overlap in content. How can I query across these? How can I determine whether any of these are relevant to my interests? What is the point in providing SPARQL endpoints (e.g., OpenBiodiv, SIBiLS, Synospecies) without giving the user any clue as to what they contain, what vocabularies they use, what identifiers, etc.?</p>
<p>The overall impression is of a bunch of tools with varying levels of sophistication stuck together on a web page. This is in no way a “game-changer”, nor is it “interlinked data”, nor is there any indication of how it supports “complex queries”.</p>
<p>It feels very much like the sort of thing one cobbles together as a demo when applying for funding. “Look at all these disconnected resources we have, give us money and we can join them together”. Instead it is being promoted as an actual product.</p>
<p>Instead of the hyperbole, why not tackle the real challenges here? At a minimum we need to know how each service describes data, those services should use the same vocabularies and identifiers for the same things, be able to tell us what entities and relationships they cover, and we should be able to query across them. This all involves hard work, obviously, so let’s stop pretending that it doesn’t and do that work, rather than claim that a list of web sites is a “game-changer”.</p>
<blockquote>
<p>Written with <a href="https://stackedit.io/">StackEdit</a>.</p>
</blockquote>
Roderic Pagehttp://www.blogger.com/profile/00269598293846172649noreply@blogger.comtag:blogger.com,1999:blog-16081779.post-75764130017150774162023-07-18T16:34:00.004+01:002024-02-07T10:37:34.065+00:00Adventures in machine learning: iNaturalist, DNA barcodes, and Lepidoptera<div style="padding:1em;border:1px solid #DDD;border-radius:0.2em;">How to cite: <strong>Page, R. (2023). Adventures in machine learning: iNaturalist, DNA barcodes, and Lepidoptera. <a href="https://doi.org/10.59350/5q854-j4s23">https://doi.org/10.59350/5q854-j4s23</a></strong></div>
<p>Recently I’ve been working with a masters student, Maja Nagler, on a project using machine learning to identify images of Lepidoptera. This has been something of an adventure as I am new to machine learning, and have only minimal experience with the Python programming language. So what could possibly go wrong?</p>
<p>The inspiration for this project comes from (a) using iNaturalist’s machine learning to help identify pictures I take using their app, and (b) exploring DNA barcoding data which has a wealth of images of specimens linked to DNA sequences (see <a href="https://www.gbif.org/occurrence/gallery?dataset_key=040c5662-da76-4782-a48e-cdea1892d14c&taxon_key=797">gallery in GBIF</a>), and presumably reliably identified (by the barcodes). So, could we use the DNA images to build models to identify specimens? Is it possible to use models already trained on citizen science data, or do we need custom models trained on specimens? Can models trained on museum specimens be used to identify living specimens?</p>
<p>To answer this we’ve started simple, using the <a href="(https://github.com/visipedia/inat_comp)">iNaturalist 2018</a> competition as a starting point. There is a code in GitHub for an entry in that challenge, and the challenge data is available, so the idea was to take that code and model and see how well it works on DNA barcode images.</p>
<p>That was the plan. I ran into a slew of Python-related issues involving out of date code, dependencies, and issues with running on a MacBook. Python is, well, a mess. I know there are ways to “tame” the mess, but I’m amazed that anyone can get anything done in machine learning given how temperamental the tools are.</p>
<p>Another consideration is that machine learning is computationally intensive, and typically uses PC with NVIDIA chips. Macs don 't have these chips. However, Apple’s newer Macs provide <a href="https://developer.apple.com/metal/pytorch/">Metal Performance Shaders (MPS)</a> which does speed things up. But getting everything to work together was a nightmare. This is a field full of obscure incantations, bugs, and fixes. I describe some of the things I went through in the <a href="https://github.com/rdmpage/inat_comp_2018">README for the repository</a>. Note that this code is really a safety net. Maja is working on a more recent model (using Google’s <a href="https://colab.research.google.com/?utm_source=scs-index">Colab</a>), I just wanted to make sure that we had a backup in place in case my notion that this ML stuff would be “easy” turned out to be, um, wrong.</p>
<p>Long story short, everything now works. Because our focus is Lepidoptera (moths and butterflies) I ended up subsetting the original challenge dataset to include just those taxa. This resulted in 1234 species. This is obviously a small number, but it means we can train a reasonable model in less than a week (ML is really, really, computationally expensive).</p>
<p>There is still lots to do, but I want to share a small result. After training the model on Lepidoptera from the iNaturalist 2018 dataset, I ran a small number of images from the DNA barcode dataset. The results are encouraging. For example, for <em>Junonia villida</em> all the barcoded specimens were either correctly identified (green) or were in the top three hits (orange) (the code works is it outputs the top three hits for each image). So a model trained on citizen science images of (mostly) living specimens can identify museum specimens.</p>
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiry3LTErcKJ41O3R5iLAV2vEbV6Vm49yMx7Wt69xyoDl8B5O62NmDRjeCpkLx5j2EML6JRpppG6B9zN0m6xfSWHivHPb6GCFF3EnOdlrAD9d8Fj03TVNrwnQJVVnVaGR8zMcJiSl_IAv1-9ITxbogg240-sxFTV9hYFyVyzuSnAZLwVA3Z8HWN/s1852/Screenshot%202023-07-18%20at%2016.16.37.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" width="400" data-original-height="786" data-original-width="1852" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiry3LTErcKJ41O3R5iLAV2vEbV6Vm49yMx7Wt69xyoDl8B5O62NmDRjeCpkLx5j2EML6JRpppG6B9zN0m6xfSWHivHPb6GCFF3EnOdlrAD9d8Fj03TVNrwnQJVVnVaGR8zMcJiSl_IAv1-9ITxbogg240-sxFTV9hYFyVyzuSnAZLwVA3Z8HWN/s400/Screenshot%202023-07-18%20at%2016.16.37.png"/></a></div>
<p>For other species the results are not so great, but are still interesting. For example, for <em>Junonia orithya</em> quite a few images are not correctly identified (red). Looking at the images, it looks like specimens photographed ventrally are going to be a problem (unlikely to be common angle for photographs of living specimens), and specimens with scale grids and QR codes are unlikely to be seen in the wild(!).</p>
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEik3peDy6dm74pbqLczY3CHnVVOQSNfWniZNdOfaHWgVFMV1vomKcaGWmg4jqJ1nE8qZtVOwgs6_xGkiWFXbaqS47K48P8MAdjy0btwxfvZ86JWZstmbM8lNzS390btL8wU65N-esJkMXG-Uq9ohf6k33Vi_aTDZkHAY22pYkkER_6UN-iPJz73/s1854/Screenshot%202023-07-18%20at%2016.21.02.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" width="400" data-original-height="1454" data-original-width="1854" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEik3peDy6dm74pbqLczY3CHnVVOQSNfWniZNdOfaHWgVFMV1vomKcaGWmg4jqJ1nE8qZtVOwgs6_xGkiWFXbaqS47K48P8MAdjy0btwxfvZ86JWZstmbM8lNzS390btL8wU65N-esJkMXG-Uq9ohf6k33Vi_aTDZkHAY22pYkkER_6UN-iPJz73/s400/Screenshot%202023-07-18%20at%2016.21.02.png"/></a></div>
<p>An obvious thing to do would be to train a model based on DNA barcode specimens and see how well it identifies citizen science images (and Maja will be doing just that). If that works well, then that would suggest that there is scope for expanding models for identifying live insects to include museum specimen images (and visa versa), see also <a href="https://iphylo.blogspot.com/2017/12/towards-digital-natural-history-museum.html">Towards a digital natural history museum</a>.</p>
<p>It is early days, still lots of work to do, and deadlines are pressing, but I’m looking forward to seeing how Maja’s project evolves. Perhaps the pain of Python, PyTorch, MPS, etc. will all be worth it.</p>
<blockquote>
<p>Written with <a href="https://stackedit.io/">StackEdit</a>.</p>
</blockquote>
Roderic Pagehttp://www.blogger.com/profile/00269598293846172649noreply@blogger.comtag:blogger.com,1999:blog-16081779.post-70369316473661599412023-06-17T16:38:00.003+01:002024-02-07T10:35:42.524+00:00A taxonomic search engine<div style="padding:1em;border:1px solid #DDD;border-radius:0.2em;">How to cite: <strong>Page, R. (2023). A taxonomic search engine. <a href="https://doi.org/10.59350/r3g44-d5s15">https://doi.org/10.59350/r3g44-d5s15</a></strong></div>
<p>Tony Rees commented on my recent post <a href="https://iphylo.blogspot.com/2023/05/ten-years-and-million-links.html">Ten years and a million links</a>. I’ve responded to some of his comments, but I think the bigger question deserves more space, hence this blog post.</p>
<h2 id="tony’s-comment">Tony’s comment</h2>
<blockquote>
<p>Hi Rod, I like what you’re doing. Still struggling (a little) to find the exact point where it answers the questions that are my “entry points” so to speak, which (paraphrasing a post of yours from some years back) start with:</p>
</blockquote>
<blockquote>
<ul>
<li>Is this a name that “we” (the human race I suppose) recognise as having been used for a taxon (think Global Names Resolver, Taxamatch, etc.) - preferably an automatable query and response (i.e., a machine can ask it and incorporated the result into a workflow)</li>
<li>Does it refer to a currently accepted taxon or if not, what is the accepted equivalent</li>
<li>What is its taxonomic placement (according to one or a range of expert systems)</li>
<li>Also, for reporting/comparison/analysis purposes…<br>
- How many accepted taxa (at whatever rank) are currently known in group X (or the whole world)</li>
<li>How many new names (accepted or unaccepted) were published in year A (or date range A-C)</li>
<li>How many new names were published (or co-authored) by author Z</li>
<li>(and probably more)</li>
</ul>
</blockquote>
<blockquote>
<p>Having access to more of the primary literature is great, and necessary, but does not help me in those respects (since the published works must still be parsed by a human, not a machine). But maybe it does answer some other questions like how many original works were published by author Z, in a particular time frame.</p>
</blockquote>
<blockquote>
<p>Of course as you will be aware, using ORCIDs for authors is only a small portion of the puzzle, since ORCIDs are not issued for deceased authors, or those who never request one, so far as I am aware.</p>
</blockquote>
<blockquote>
<p>None of the above is a criticism of what you are doing! Just trying to see if I can establish any new linkages to what you are doing which will enable me to automate portions of my own efforts to a greater degree (let machines do things that currently still require a human). So far (as evidenced by the most recent ION data dump you were able to supply) it is giving me a DOI in many cases as a supplement to the title of the original work (per ION/Zoological Record) which is something of a time saver in my quest to read the original work (from which I could extract the DOI as well once reached) but does not really automate anything since I still have to try and find it in order to peruse the content.</p>
</blockquote>
<blockquote>
<p>Mostly random thoughts above, possibly of no use, but I do ruminate on the universe of connected “things not strings” in the hope that one day life will get easier for biodiversity informatics workers, or maybe that the “book of life” will be self-assembling…</p>
</blockquote>
<h2 id="my-response">My response</h2>
<p>I think there are several ways to approach this. I’ll walk through them below, but TL;DR</p>
<ul>
<li>Define the questions we have and how we would get the answers. For example, what combination database and SQL queries, or web site and API calls, or knowledge graph and SPARQL do we need to answer each question?</li>
<li>Decide what sort of interface(s) we want. Do we want a web site with a search box, a bunch of API calls, or a natural language interface?</li>
<li>If we want natural language, how do we do that? Do we want a ChatBot?</li>
<li>And as an aside, how can we speed up reading the taxonomic literature?</li>
</ul>
<p>The following are more notes than a reasoned essay. I wanted to record a bunch of things to help me think about these topics.</p>
<h3 id="build-a-natural-language-search-engine">Build a natural language search engine</h3>
<p>One of the first things I read that opened my eyes to the potential of OpenAI-powered tools and how to build them was <a href="https://paul-graham-gpt.vercel.app">Paul Graham GPT</a> which I talked about <a href="https://iphylo.blogspot.com/2023/04/chatgpt-semantic-search-and-knowledge.html">here</a>. This is a simple question and answer tool that takes a question and returns an answer, based on Paul Graham’s blog posts. We could do something similar for taxonomic names (or indeed, anything where we have some text and want to query it). At its core we have a bunch of blocks of text, embeddings for those blocks, then we get embeddings for the question and find the best-matching embeddings for the blocks of text.</p>
<h4 id="generating-queries">Generating queries</h4>
<p>One approach is to use ChatGPT to formulate a database query based on a natural langue question. There have been a bunch of people exploring generating SPARQL queries from natural langue, e.g. <a href="https://apex974.com/articles/chatgpt-for-info-retrieval-from-knowledge-graph">ChatGPT for Information Retrieval from Knowledge Graph</a>, <a href="https://medium.com/openlink-software-blog/using-chatgpt-to-generate-a-course-description-knowledge-graph-using-rdf-daa107560776">ChatGPT Exercises — Generating a Course Description Knowledge Graph using RDF</a>, and <a href="https://finnaarupnielsen.wordpress.com/2023/03/31/wikidata-and-chatgpt/">Wikidata and ChatGPT</a> and this could be explored for other query languages.</p>
<p>So in this approach we take natural language questions and get back the queries we need to answer those questions. We then go away and run those queries.</p>
<h4 id="generating-answers">Generating answers</h4>
<p>This still leaves us with what to do with the answers. Given, say, a SPARQL response, we could have code that generates a bunch of simple sentences from that response, e.g. “name x is a synonym of name y”, “there are ten species in genus x”, “name x was published in the paper zzz”, etc. We then pass those sentences to an AI to summarise into nicer natural language. We should aim for something like the Wikipedia-derived snippets from DBpedia (see <a href="https://iphylo.blogspot.com/2019/05/ozymandias-meets-wikipedia-with-notes.html">Ozymandias meets Wikipedia, with notes on natural language generation</a>). Indeed, we could help make more engaging answers by adding DBpedia snippets for the relevant taxa, abstracts from relevant papers, etc. to the SPARQL results and ask the AI to summarise all of that.</p>
<h4 id="skipping-queries-altogether">Skipping queries altogether</h4>
<p>Another approach is to generate all the answers ahead of time. Essentially, we take our database or knowledge graph and generate simplified sentences summarising everything we know: “species x was described by author y in 1920”, “species x was synonymies with species y in 1967”, etc. We then get embeddings for these answers, store them in a vector database, and we can query them using a chatbot-style interface.</p>
<p>There is a big literature on embedding RDF (see <a href="http://rdf2vec.org">RDF2vec.org</a>), and also converting RDF to sentences. These “RDF verbalisers” are further discussed on the <a href="https://synalp.gitlabpages.inria.fr/webnlg-challenge/">WebNLG Challenge pages</a>, and an example is here: <a href="https://github.com/rali-udem/jsRealB">jsRealB - A JavaScript Bilingual Text Realizer for Web Development</a>.</p>
<p>This approach is like the game <a href="https://en.wikipedia.org/wiki/Jeopardy!">Jeopardy!</a>: we generate all the answers and the goal is to match the user’s question to one or more of those answers.</p>
<h3 id="machine-readability">Machine readability</h3>
<blockquote>
<p>Having access to more of the primary literature is great, and necessary, but does not help me in those respects (since the published works must still be parsed by a human, not a machine).</p>
</blockquote>
<p>This is a good point, but help is at hand. There are a bunch of AI tools to “read” the literature for you, such as <a href="https://typeset.io">SciSpace’s Copilot</a>. I think there’s a lot we could do to explore these tools. We could also couple them with the name - publication links in the ten year library. For example, if we know that there is a link between a name and a DOI, and we have the text for the article with that DOI we could then ask targeted questions regarding what the papers says about that name. One way to implement this is to do something similar to the Paul Graham GPT demo described above. We take the text of the paper, chunk it into smaller blocks (e.g., paragraphs), get embeddings for each block, add those to a vector database, and we can then search that paper (and others) using natural language. We could imagine an API that takes a paper and splits out the core “facts” or assertions that the paper makes. This also speaks to <a href="https://iphylo.blogspot.com/2019/07/notes-on-collections-knowledge-graphs.html">Notes on collections, knowledge graphs, and Semantic Web browsers</a> where I bemoaned the lack of a semantic web browser.</p>
<h2 id="summary">Summary</h2>
<p>I think the questions being asked are all relatively straightforward to answer, we just need to think a little bit about the best way to answer them. Much of what I’ve written above is focussed on making such a system more broadly useful and engaging, with richer answers that a simple database query. But a first step is to define the questions and the queries that would answer them, then figure out what interface to wrap this up in.</p>
<blockquote>
<p>Written with <a href="https://stackedit.io/">StackEdit</a>.</p>
</blockquote>
Roderic Pagehttp://www.blogger.com/profile/00269598293846172649noreply@blogger.comtag:blogger.com,1999:blog-16081779.post-58838158915165345872023-05-31T18:26:00.003+01:002023-05-31T18:29:40.663+01:00Ten years and a million links<p>As trailed on a Twitter thread last week I’ve been working on a manuscript describing the efforts to map taxonomic names to their original descriptions in the taxonomic literature.</p>
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Putting together a manuscript on linking taxonomic names to the primary literature, basically “um, what, exactly, have you been doing all these years?”. TL;DR Across fungi, plants, and animals approx 1.3 million names have been linked to a persistent identifier for a publication.</p>— Roderic Page (@rdmpage) <a href="https://twitter.com/rdmpage/status/1661714128413573120?ref_src=twsrc%5Etfw">May 25, 2023</a></blockquote>
<p>The preprint is on bioRxiv <a href="https://doi.org/10.1101/2023.05.29.542697">doi:10.1101/2023.05.29.542697</a></p>
<blockquote>
<p>A major gap in the biodiversity knowledge graph is a connection between taxonomic names and the taxonomic literature. While both names and publications often have persistent identifiers (PIDs), such as Life Science Identifiers (LSIDs) or Digital Object Identifiers (DOIs), LSIDs for names are rarely linked to DOIs for publications. This article describes efforts to make those connections across three large taxonomic databases: Index Fungorum, International Plant Names Index (IPNI), and the Index of Organism Names (ION). Over a million names have been matched to DOIs or other persistent identifiers for taxonomic publications. This represents approximately 36% of names for which publication data is available. The mappings between LSIDs and publication PIDs are made available through ChecklistBank. Applications of this mapping are discussed, including a web app to locate the citation of a taxonomic name, and a knowledge graph that uses data on researcher’s ORCID ids to connect taxonomic names and publications to authors of those names.</p>
</blockquote>
<p>Much of the work has been linking taxa to names, which still has huge gaps. There are also interesting differences in coverage between plants, animals, and fungi (see preprint for details).</p>
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhdWsSQhqi1DErXMIHm28g37-fiALNIsI5eQZmvoX_Fe03ZSwtKHbYt-LCsCCAUop0AGcwy_w7NpIjylVH1hNrM9oW-6j9e6tHASha49TTqFvDg2_tEx3r74RRFsjUo4M_Qat8NmKaZSChOt2hI3LsMjTVLrEVirEckU-9Ei7ug-7OHQlR4LA/s2276/animals-coverage.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" width="320" data-original-height="2276" data-original-width="2276" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhdWsSQhqi1DErXMIHm28g37-fiALNIsI5eQZmvoX_Fe03ZSwtKHbYt-LCsCCAUop0AGcwy_w7NpIjylVH1hNrM9oW-6j9e6tHASha49TTqFvDg2_tEx3r74RRFsjUo4M_Qat8NmKaZSChOt2hI3LsMjTVLrEVirEckU-9Ei7ug-7OHQlR4LA/s320/animals-coverage.png"/></a></div><div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjdyxlVJ-oyMCNPmHtHWjSxdxMSJvgzdWRGRF6Ad4dk7ab7gGDpuKdKmS9XhROkopw361ylfsTd1ZkwkF6BN0JlWNnVLCKY1AfryCfWKHkgPQM7u-0SELW9j8RlQIflb6ibaV64gwW7oJrEvOGECvR51F8EW8cRg-1usW-GBM5ymObj7zlObQ/s2276/fungi-coverage.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" width="320" data-original-height="2276" data-original-width="2276" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjdyxlVJ-oyMCNPmHtHWjSxdxMSJvgzdWRGRF6Ad4dk7ab7gGDpuKdKmS9XhROkopw361ylfsTd1ZkwkF6BN0JlWNnVLCKY1AfryCfWKHkgPQM7u-0SELW9j8RlQIflb6ibaV64gwW7oJrEvOGECvR51F8EW8cRg-1usW-GBM5ymObj7zlObQ/s320/fungi-coverage.png"/></a></div><div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgf0YBuvNSXWAJTfQ1jk4XSocMzCYHP7t6IPUqhjQ3mftgM_850igWaD2copgNH6Xk6T62xBU641wvwOvXgCCDY3m2xC_gaILXO9RGx8H3Gpy5OOncsLb9smpT2LIgtYOExVBVdDRWqA0AZ8-mQjWL7dL5TiG7MqVu8spT8ACoGOPR_T36hRA/s2276/plants-coverage.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" width="320" data-original-height="2276" data-original-width="2276" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgf0YBuvNSXWAJTfQ1jk4XSocMzCYHP7t6IPUqhjQ3mftgM_850igWaD2copgNH6Xk6T62xBU641wvwOvXgCCDY3m2xC_gaILXO9RGx8H3Gpy5OOncsLb9smpT2LIgtYOExVBVdDRWqA0AZ8-mQjWL7dL5TiG7MqVu8spT8ACoGOPR_T36hRA/s320/plants-coverage.png"/></a></div>
There is also a simple app to demonstrate these links, see <a href="https://species-cite.herokuapp.com">https://species-cite.herokuapp.com</a>.
<blockquote>
<p>Written with <a href="https://stackedit.io/">StackEdit</a>.</p>
</blockquote>
Roderic Pagehttp://www.blogger.com/profile/00269598293846172649noreply@blogger.comtag:blogger.com,1999:blog-16081779.post-1429128217542226342023-04-25T14:01:00.004+01:002023-04-27T15:51:08.020+01:00Library interfaces, knowledge graphs, and Miller columns<p>Some quick notes on interface ideas for digital libraries and/or knowledge graphs.</p>
<p>Recently there’s been something of an explosion in bibliographic tools to explore the literature. Examples include:</p>
<ul>
<li><a href="https://elicit.org">Elicit</a> which uses AI to search for and summarise papers</li>
<li><a href="https://scite.ai">_scite</a> which uses AI to do sentiment analysis on citations (does paper A cite paper B favourably or not?)</li>
<li><a href="https://www.researchrabbit.ai">ResearchRabbit</a> which uses lists, networks, and timelines to discover related research</li>
<li><a href="https://typeset.io">Scispace</a> which navigates connections between papers, authors, topics, etc., and provides AI summaries.</li>
</ul>
<p>As an aside, I think these (and similar tools) are a great example of how bibliographic data such as abstracts, the citation graph and - to a lesser extent - full text - have become commodities. That is, what was once proprietary information is now free to anyone, which in turns means a whole ecosystem of new tools can emerge. If I was clever I’d be building a <a href="https://en.wikipedia.org/wiki/Wardley_map">Wardley map</a> to explore this. Note that a decade or so ago reference managers like <a href="https://www.zotero.org">Zotero</a> were made possible by publishers exposing basic bibliographic data on their articles. As we move to <a href="https://i4oc.org">open citations</a> we are seeing the next generation of tools.</p>
<p>Back to my main topic. As usual, rather than focus on what these tools do I’m more interested in how they <strong>look</strong>. I have history here, when the iPad came out I was intrigued by the possibilities it offered for displaying academic articles, as discussed <a href="https://iphylo.blogspot.com/2010/08/viewing-scientific-articles-on-ipad.html">here</a>, <a href="https://iphylo.blogspot.com/2010/09/viewing-scientific-articles-on-ipad.html">here</a>, <a href="https://iphylo.blogspot.com/2010/08/viewing-scientific-articles-on-ipad_24.html">here</a>, <a href="https://iphylo.blogspot.com/2010/08/viewing-scientific-articles-on-ipad_3052.html">here</a>, and <a href="https://iphylo.blogspot.com/2010/08/viewing-scientific-articles-on-ipad_31.html">here</a>. ResearchRabbit looks like this:</p>
<div style="padding:86.91% 0 0 0;position:relative;"><iframe src="https://player.vimeo.com/video/820871442?h=23b05b0dae&badge=0&autopause=0&player_id=0&app_id=58479" frameborder="0" allow="autoplay; fullscreen; picture-in-picture" allowfullscreen style="position:absolute;top:0;left:0;width:100%;height:100%;" title="ResearchRabbit"></iframe></div><script src="https://player.vimeo.com/api/player.js"></script>
<p>Scispace’s <a href="https://typeset.io/explore/journals/parassitologia-1ieodjwe">“trace” view</a> looks like this:</p>
<div style="padding:84.55% 0 0 0;position:relative;"><iframe src="https://player.vimeo.com/video/820871348?h=2db7b661ef&badge=0&autopause=0&player_id=0&app_id=58479" frameborder="0" allow="autoplay; fullscreen; picture-in-picture" allowfullscreen style="position:absolute;top:0;left:0;width:100%;height:100%;" title="Scispace screencast"></iframe></div><script src="https://player.vimeo.com/api/player.js"></script>
<p>What is interesting about both is that they display content from left to right in vertical columns, rather than the more common horizontal rows. This sort of display is sometimes called <a href="https://en.wikipedia.org/wiki/Miller_columns">Miller columns</a> or a <a href="https://web.archive.org/web/20210726134921/http://designinginterfaces.com/firstedition/index.php?page=Cascading_Lists">cascading list</a>.</p>
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjBPnV9fRBcvm-BX5PjzfG5Cff9PerCLsTW8d5ZbsL6b41t7ypD7ovmcgfTf3b4b34mbq8NM4sfwOHkgEq32FLYnD497RFQD4HQmYmh5Eveu1zWdDVyKyDtPyE98QoxTaOEnLA5kK0fnl3dOOEgUvtVKlTZ8bt1gj2v_8tDRWl9f50ybyei3A/s1024/GNUstep-liveCD.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" width="400" data-original-height="768" data-original-width="1024" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjBPnV9fRBcvm-BX5PjzfG5Cff9PerCLsTW8d5ZbsL6b41t7ypD7ovmcgfTf3b4b34mbq8NM4sfwOHkgEq32FLYnD497RFQD4HQmYmh5Eveu1zWdDVyKyDtPyE98QoxTaOEnLA5kK0fnl3dOOEgUvtVKlTZ8bt1gj2v_8tDRWl9f50ybyei3A/s400/GNUstep-liveCD.png"/></a></div>
<p>By Gürkan Sengün (talk) - Own work, Public Domain, <a href="https://commons.wikimedia.org/w/index.php?curid=594715">https://commons.wikimedia.org/w/index.php?curid=594715</a></p>
<p>I’ve always found displaying a knowledge graph to be a challenge, as discussed <a href="https://iphylo.blogspot.com/2019/07/notes-on-collections-knowledge-graphs.html">elsewhere on this blog</a> and in my paper on <a href="https://peerj.com/articles/6739/#p-29">Ozymandias</a>. Miller columns enable one to drill down in increasing depth, but it doesn’t need to be a tree, it can be a path within a network. What I like about ResearchRabbit and the original Scispace interface is that they present the current item together with a list of possible connections (e.g., authors, citations) that you can drill down on. Clicking on these will result in a new column being appended to the right, with a view (typically a list) of the next candidates to visit. In graph terms, these are adjacent nodes to the original item. The clickable badges on each item can be thought of as sets of edges that have the same label (e.g., “authored by”, “cites”, “funded”, “is about”, etc.). Each of these nodes itself becomes a starting point for further exploration. Note that the original starting point isn’t privileged, other than being the starting point. That is, each time we drill down we are seeing the same type of information displayed in the same way. Note also that the navigation can be though of as a <strong>card</strong> for a node, with <strong>buttons</strong> grouping the adjacent nodes. When we click on an individual button, it expands into a <strong>list</strong> in the next column. This can be thought of as a preview for each adjacent node. Clicking on an element in the list generates a new card (we are viewing a single node) and we get another set of buttons corresponding to the adjacent nodes.</p>
<p>One important behaviour in a Miller column interface is that the current path can be pruned at any point. If we go back (i.e., scroll to the left) and click on another tab on an item, everything downstream of that item (i.e., to the right) gets deleted and replaced by a new set of nodes. This could make retrieving a particular history of browsing a bit tricky, but encourages exploration. Both Scispace and ResearchRabbit have the ability to add items to a collection, so you can keep track of things you discover.</p>
<p>Lots of food for thought, I’m assuming that there is some user interface/experience research on Miller columns. One thing to remember is that Miller columns are most often associated with trees, but in this case we are exploring a network. That means that potentially there is no limit to the number of columns being generated as we wander through the graph. It will be interesting to think about what the average depth is likely to be, in other words, how deep down the rabbit hole will be go?</p>
<h3>Update</h3>
<p>Should add link to David Regev's explorations of <a href="https://medium.com/david-regev-on-ux/flow-browser-b730daf0f717">Flow Browser</a>.
<blockquote>
<p>Written with <a href="https://stackedit.io/">StackEdit</a>.</p>
</blockquote>
Roderic Pagehttp://www.blogger.com/profile/00269598293846172649noreply@blogger.comtag:blogger.com,1999:blog-16081779.post-46921979595691182532023-04-03T16:30:00.003+01:002023-04-03T16:32:04.493+01:00ChatGPT, semantic search, and knowledge graphs<p>One thing about ChatGPT is it has opened my eyes to some concepts I was dimly aware of but am only now beginning to fully appreciate. ChatGPT enables you ask it questions, but the answers depend on what ChatGPT “knows”. As several people have noted, what would be even better is to be able to run ChatGPT on your own content. Indeed, ChatGPT itself now supports this using <a href="https://openai.com/blog/chatgpt-plugins">plugins</a>.</p>
<h4 id="paul-graham-gpt">Paul Graham GPT</h4>
<p>However, it’s still useful to see how to add ChatGPT functionality to your own content from scratch. A nice example of this is <a href="https://paul-graham-gpt.vercel.app/">Paul Graham GPT</a> by <a href="https://twitter.com/mckaywrigley">Mckay Wrigley</a>. Mckay Wrigley took essays by Paul Graham (a well known venture capitalist) and built a question and answer tool very like ChatGPT.</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/ii1jcLg-eIQ" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
<p>Because you can send a block of text to ChatGPT (as part of the prompt) you can get ChatGPT to summarise or transform that information, or answer questions based on that information. But there is a limit to how much information you can pack into a prompt. You can’t put all of Paul Graham’s essays into a prompt for example. So a solution is to do some preprocessing. For example, given a question such as “How do I start a startup?” we could first find the essays that are most relevant to this question, then use them to create a prompt for ChatGPT. A quick and dirty way to do this is simply do a text search over the essays and take the top hits. But we aren’t searching for words, we are searching for answers to a question. The essay with the best answer might not include the phrase “How do I start a startup?”.</p>
<h4 id="semantic-search">Semantic search</h4>
<p>Enter <a href="https://en.wikipedia.org/wiki/Semantic_search">Semantic search</a>. The key concept behind semantic search is that we are looking for documents with similar meaning, not just similarity of text. One approach to this is to represent documents by “embeddings”, that is, a vector of numbers that encapsulate features of the document. Documents with similar vectors are potentially related. In semantic search we take the query (e.g., “How do I start a startup?”), compute its embedding, then search among the documents for those with similar embeddings.</p>
<p>To create Paul Graham GPT Mckay Wrigley did the following. First he sent each essay to the OpenAI API underlying ChatGPT, and in return he got the embedding for that essay (a vector of 1536 numbers). Each embedding was stored in a database (Mckay uses Postgres with <a href="https://github.com/pgvector/pgvector">pgvector</a>). When a user enters a query such as “How do I start a startup?” that query is also sent to the OpenAI API to retrieve its embedding vector. Then we query the database of embeddings for Paul Graham’s essays and take the top five hits. These hits are, one hopes, the most likely to contain relevant answers. The original question and the most similar essays are then bundled up and sent to ChatGPT which then synthesises an answer. See his <a href="https://github.com/mckaywrigley/paul-graham-gpt">GitHub repo</a> for more details. Note that we are still using ChatGPT, but on a set of documents it doesn’t already have.</p>
<h4 id="knowledge-graphs">Knowledge graphs</h4>
<p>I’m a fan of knowledge graphs, but they are not terribly easy to use. For example, I built a knowledge graph of Australian animals <a href="https://ozymandias-demo.herokuapp.com">Ozymandias</a> that contains a wealth of information on taxa, publications, and people, wrapped up in a web site. If you want to learn more you need to figure out how to write queries in SPARQL, which is not fun. Maybe we could use ChatGPT to write the SPARQL queries for us, but it would be much more fun to be simply ask natural language queries (e.g., “who are the experts on Australian ants?”). I made some naïve notes on these ideas <a href="https://iphylo.blogspot.com/2015/09/possible-project-natural-language.html">Possible project: natural language queries, or answering “how many species are there?”</a> and <a href="https://iphylo.blogspot.com/2019/05/ozymandias-meets-wikipedia-with-notes.html">Ozymandias meets Wikipedia, with notes on natural language generation</a>.</p>
<p>Of course, this is a well known problem. Tools such as <a href="http://rdf2vec.org">RDF2vec</a> can take RDF from a knowledge graph and create embeddings which could in tern be used to support semantic search. But it seems to me that we could simply this process a bit by making use of ChatGPT.</p>
<p>Firstly we would generate natural language statements from the knowledge graph (e.g., “species x belongs to genus y and was described in z”, “this paper on ants was authored by x”, etc.) that cover the basic questions we expect people to ask. We then get embeddings for these (e.g., using OpenAI). We then have an interface where people can ask a question (“is species x a valid species?”, “who has published on ants”, etc.), we get the embedding for that question, retrieve natural language statements that the closest in embedding “space”, package everything up and ask ChatGPT to summarise the answer.</p>
<p>The trick, of course, is to figure out how t generate natural language statements from the knowledge graph (which amounts to deciding what paths to traverse in the knowledge graph, and how to write those paths is something approximating English). We also want to know something about the sorts of questions people are likely to ask so that we have a reasonable chance of having the answers (for example, are people going to ask about individual species, or questions about summary statistics such as numbers of species in a genus, etc.).</p>
<p>What makes this attractive is that it seems a straightforward way to go from a largely academic exercise (build a knowledge graph) to something potentially useful (a question and answer machine). Imagine if something like the defunct BBC wildlife site (see <a href="https://iphylo.blogspot.com/2017/12/blue-planet-ii-bbc-and-semantic-web.html">Blue Planet II, the BBC, and the Semantic Web: a tale of lessons forgotten and opportunities lost</a>) revived <a href="https://aspiring-look.glitch.me">here</a> had a question and answer interface where we could ask questions rather than passively browse.</p>
<h4 id="summary">Summary</h4>
<p>I have so much more to learn, and need to think about ways to incorporate semantic search and ChatGPT-like tools into knowledge graphs.</p>
<blockquote>
<p>Written with <a href="https://stackedit.io/">StackEdit</a>.</p>
</blockquote>
Roderic Pagehttp://www.blogger.com/profile/00269598293846172649noreply@blogger.comtag:blogger.com,1999:blog-16081779.post-63154373806362458242023-04-03T13:52:00.003+01:002023-04-03T13:57:01.158+01:00ChatGPT, of course<p>I haven’t blogged for a while, work and <a href="https://iphylo.blogspot.com/2023/03/dugald-stuart-page-1936-2022.html">other reasons</a> have meant I’ve not had much time to think, and mostly I blog to help me think.</p>
<p>ChatGPT is obviously a big thing at the moment, and once we get past the moral panic (“students can pass exams using AI!”) there are a lot of interesting possibilities to explore. Inspired by essays such as <a href="https://medium.com/@aarontay/how-q-a-systems-based-on-large-language-models-eg-gpt4-will-change-things-if-they-become-the-norm-c7cf62736ba">How Q&A systems based on large language models (eg GPT4) will change things if they become the dominant search paradigm — 9 implications for libraries</a> and <a href="https://about.sourcegraph.com/blog/cheating-is-all-you-need">Cheating is All You Need</a>, as well as [<a href="https://paul-graham-gpt.vercel.app/">Paul Graham GPT</a>](<a href="https://paul-graham-gpt.vercel.app">https://paul-graham-gpt.vercel.app</a>) I thought I’d try a few things and see where this goes.</p>
<p>ChatGPT can do some surprising things.</p>
<h4 id="parse-bibliographic-data">Parse bibliographic data</h4>
<p>I spend a LOT of time working with bibliographic data, trying to parse it into structured data. ChatGPT can do this:</p>
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj2bA-Xk6D2wfexP0cs-U3AQa534hNW7J06I0avM82WD3dZSiRvXyHFScv3Hb8QfKKp1fk4GU4olhv1BOq8Pqu4IJgptlBap3xcLI1bFJwB3EOmHKxqf4iy2exJy4vwZ2n4I0U0JVui4Xmhvdy3mTbMSXCRbmCglNFULi5oQEhtPVG4gLJjdw/s924/Screenshot%202023-04-03%20at%2012.59.30.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" height="400" data-original-height="924" data-original-width="738" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj2bA-Xk6D2wfexP0cs-U3AQa534hNW7J06I0avM82WD3dZSiRvXyHFScv3Hb8QfKKp1fk4GU4olhv1BOq8Pqu4IJgptlBap3xcLI1bFJwB3EOmHKxqf4iy2exJy4vwZ2n4I0U0JVui4Xmhvdy3mTbMSXCRbmCglNFULi5oQEhtPVG4gLJjdw/s400/Screenshot%202023-04-03%20at%2012.59.30.png"/></a></div>
<p>Note that it does more than simply parse the strings, it expands journal abbreviations such as “J. Malay Brch. R. Asiat. Soc.” to the full name “Journal of the Malayan Branch of the Royal Asiatic Society”. So we can get clean, parsed data in a range of formats.</p>
<h4 id="parse-specimens">Parse specimens</h4>
<p>Based on the success with parsing bibliographic strings I wondered how well it could handle citation software specimens (“material examined”). Elsewhere I’ve been critical of Plazi’s ability to do this, see <a href="https://iphylo.blogspot.com/2021/10/problems-with-plazi-parsing-how.html">Problems with Plazi parsing: how reliable are automated methods for extracting specimens from the literature?</a>.</p>
<p>For example, given this specimen record on p. 130 of <a href="https://doi.org/10.5852/ejt.2021.775.1553">doi:10.5852/ejt.2021.775.1553</a></p>
<blockquote>
<p>LAOS • Kammoune Province, Bunghona Market, 7 km Nof Xe Bangfai River;<br>
17.13674° N, 104.98591° E; E. Jeratthitikul, K. Wisittikoson, A. Fanka, N. Wutthituntisil and P. Prasankok leg.; sold by local people;<br>
MUMNH-UNI2831.</p>
</blockquote>
<p>ChatGPT extracted a plausible Darwin Core record:</p>
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgIwVYuPXl47t1uSpJ3xpbsIT60UUWeKTKSBpIQgD8L2gkj_fuNf4sZKR76JgsfVYYHPs9G9VoM-caUhG1P3irMyB6erSYwYI5LDlLkxdrAFWdbY3ExDs34nMVOF0tq-NBtsDKMZsSJWjhJaGt9v3dtAqWvRyVMNZ46eG2sRhJTICTkFghKtw/s901/Screenshot%202023-04-03%20at%2013.30.54.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" height="400" data-original-height="901" data-original-width="764" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgIwVYuPXl47t1uSpJ3xpbsIT60UUWeKTKSBpIQgD8L2gkj_fuNf4sZKR76JgsfVYYHPs9G9VoM-caUhG1P3irMyB6erSYwYI5LDlLkxdrAFWdbY3ExDs34nMVOF0tq-NBtsDKMZsSJWjhJaGt9v3dtAqWvRyVMNZ46eG2sRhJTICTkFghKtw/s400/Screenshot%202023-04-03%20at%2013.30.54.png"/></a></div>
<p>I’ve been experimenting with parsing specimen records using the same machine learning approach for bibliographic data (e.g., <a href="https://iphylo.blogspot.com/2021/07/citation-parsing-released.html">Citation parsing tool released</a>), perhaps it’s time to rethink that idea.</p>
<h4 id="correct-ocr-errors">Correct OCR errors</h4>
<p>OCR errors are a well known limitation of the Biodiversity Heritage Library (BHL), see for example <a href="https://blog.biodiversitylibrary.org/2022/07/ocr-improvements-early-analysis.html">OCR Improvements: An Early Analysis</a>. Given that ChatGPT is built on top of a large language model it seems obvious to see how it handles OCR errors. I tried the following snippet of text from <a href="https://www.biodiversitylibrary.org/page/4312227">BHL page 4312227</a>:</p>
<blockquote>
<p>i BREVIORA No. 185</p>
<p>Jiiaii Jr., 16, who always join him and Hedriek in th(> search for<br>
frogs. Iloraeio Mayorga, research assistant to tlie author, has also<br>
been (wtreniely useful, not only in field work, but also in the<br>
tedious curatorial and laboratory work that usually fol- lows. Dr.<br>
John Randall kindly took tlie photouraplis of E. hedricki, and the<br>
Galiiiaues family of San Juan made their El Verde house available as<br>
a (MMiter of activities for the colleetiuff party. To all these<br>
jx’opic, the autlioi- feels (iee])ly indebted.</p>
<p>Elet’tiierodactylus iiKnRrcKi s]). n.</p>
<p>Type. Museum of Coniparative Zoology Xo. ;}()!)()8, c^ from El Verde,<br>
west flank of El Yuiuinc Puerto Hico. I.IOO ft. Coll. J. A. Kivero,<br>
11 Aug-. I!)(i2.</p>
</blockquote>
<p>ChatGPT returns this:</p>
<blockquote>
<p>i BREVIORA No. 185</p>
<p>Juan Jr., 16, who always joins him and Hedrick in the search for<br>
frogs. Horacio Mayorga, research assistant to the author, has also<br>
been extremely useful, not only in field work, but also in the tedious<br>
curatorial and laboratory work that usually follows. Dr. John Randall<br>
kindly took the photographs of E. hedricki, and the Galinanes family<br>
of San Juan made their El Verde house available as a center of<br>
activities for the collecting party. To all these people, the author<br>
feels deeply indebted.</p>
<p>Eleutherodactylus hedricki sp. n.</p>
<p>Type. Museum of Comparative Zoology No. 30908, c? from El Verde, west<br>
flank of El Yunque, Puerto Rico. 1100 ft. Coll. J. A. Rivero, 11 Aug.<br>
1962.</p>
</blockquote>
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEibZLxXSTdt64SMhfqS8NCcLOxbVaFmkhMQx0yEUWcm696EL6k2_VgxZFEDT-SEqOChmwDggoQLkuU6DpcOF_d-nkQRKzQLt65WKfkFlIzDRV9LVrDqmu-D0OZl1p5Vva4QUdvDb3JpeKPPDXZ-BcZqq8V4PyGepxlnwtt7-q4Sg6_aHD58gw/s859/Screenshot%202023-04-03%20at%2013.07.53.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" width="400" data-original-height="510" data-original-width="859" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEibZLxXSTdt64SMhfqS8NCcLOxbVaFmkhMQx0yEUWcm696EL6k2_VgxZFEDT-SEqOChmwDggoQLkuU6DpcOF_d-nkQRKzQLt65WKfkFlIzDRV9LVrDqmu-D0OZl1p5Vva4QUdvDb3JpeKPPDXZ-BcZqq8V4PyGepxlnwtt7-q4Sg6_aHD58gw/s400/Screenshot%202023-04-03%20at%2013.07.53.png"/></a></div>
<p>Comparing this to the scanned image ChatGPT it does pretty well, for example the gobbledegook “Elet’tiierodactylus iiKnRrcKi” is correctly translated as “Eleutherodactylus hedricki”. Running all of BHL through ChatGPT probably isn’t feasible, but one could imagine targeted cleaning of key papers.</p>
<h4 id="summary">Summary</h4>
<p>These small experiments are fairly trivial, but they are the sort of tedious tasks that would otherwise require significant programming (or other resources) to solve. But ChatGPT can do rather more, as I hope to discuss in the next post.</p>
<blockquote>
<p>Written with <a href="https://stackedit.io/">StackEdit</a>.</p>
</blockquote>
Roderic Pagehttp://www.blogger.com/profile/00269598293846172649noreply@blogger.comtag:blogger.com,1999:blog-16081779.post-3855321436909238732023-03-14T03:00:00.002+00:002023-03-22T07:25:56.491+00:00Dugald Stuart Page 1936-2022<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjZweukxntl7R5jnk3knVFVrqZ5RxC7mPZBV4gKeDIglbFzs2O442nbxqs8t8jV2tLqCU24K6gS32jW-Pe8q3O_5JR1Ms3qW1aQAZ877cKkFfcUydqUba9HsgNlX-zS9Ne92eLxRGS8F-lStTecJw2oalp3u58Yoc0oM7CUin5LKPeFIJ7Rzg/s3454/_DSC5106.jpg" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" width="400" data-original-height="2582" data-original-width="3454" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjZweukxntl7R5jnk3knVFVrqZ5RxC7mPZBV4gKeDIglbFzs2O442nbxqs8t8jV2tLqCU24K6gS32jW-Pe8q3O_5JR1Ms3qW1aQAZ877cKkFfcUydqUba9HsgNlX-zS9Ne92eLxRGS8F-lStTecJw2oalp3u58Yoc0oM7CUin5LKPeFIJ7Rzg/s400/_DSC5106.jpg"/></a></div>
My dad died last weekend. Below is a notice in today's New Zealand Herald. I'm in New Zealand for his funeral. Don't really have the words for this right now.
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiRUTOFF1VWHCl8dg3FQuaWy5LM7aX8IivdRpTtzgrdQTEymsA5bLTZE3cSQf1WQIP3XrC46JsLScP8BxTK9C5a-B1i51yg8WGSJD0heJVaoDLnerv0lD1o3qloDjqEuuyfX4wagHB5YYBmjWnGeVQvyYVngvDDf9eM6pmMtZ7x94Y4jSVrug/s3640/IMG_2870.jpeg" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" height="320" data-original-height="3640" data-original-width="1391" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiRUTOFF1VWHCl8dg3FQuaWy5LM7aX8IivdRpTtzgrdQTEymsA5bLTZE3cSQf1WQIP3XrC46JsLScP8BxTK9C5a-B1i51yg8WGSJD0heJVaoDLnerv0lD1o3qloDjqEuuyfX4wagHB5YYBmjWnGeVQvyYVngvDDf9eM6pmMtZ7x94Y4jSVrug/s320/IMG_2870.jpeg"/></a></div>
Roderic Pagehttp://www.blogger.com/profile/00269598293846172649noreply@blogger.comtag:blogger.com,1999:blog-16081779.post-22163554264553994352022-12-16T17:54:00.006+00:002022-12-17T08:12:23.397+00:00David Remsen<p>I heard yesterday from Martin Kalfatovic (BHL) that David Remsen has died. Very sad news. It's starting to feel like iPhylo might end up being a list of obituaries of people working on biodiversity informatics (e.g., <a href="https://iphylo.blogspot.com/2016/05/scott-federhen-rip.html">Scott Federhen</a>).</p>
<p>I spent several happy visits at MBL at Woods Hole talking to Dave at the height of the uBio project, which really kickstarted large scale indexing of taxonomic names, and the use of taxonomic name finding tools to index the literature. His work on uBio with David ("Paddy") Patterson led to the <a href="https://eol.org">Encyclopedia of Life</a> (EOL).</p>
<p>A number of the things I'm currently working on are things Dave started. For example, I recently uploaded a version of his dataset for Nomenclator Zoologicus[1] to <a href="https://www.checklistbank.org/dataset/126539/about">ChecklistBank</a> where I'm working on augmenting that original dataset by adding links to the taxonomic literature. My <a href="https://biorss.herokuapp.com/?feed=Y291bnRyeT1XT1JMRCZwYXRoPSU1QiUyMkJJT1RBJTIyJTVE">BioRSS project</a> is essentially an attempt to revive uBioRSS[2] (see <a href="https://iphylo.blogspot.com/2021/11/revisiting-rss-to-monitor-latests.html">Revisiting RSS to monitor the latest taxonomic research</a>).</p>
<p>I have fond memories of those visits to Woods Hole. A very sad day indeed.</p>
<p><b>Update:</b> The David Remsen Memorial Fund has been set up on <a href="https://www.gofundme.com/f/david-remsen-memorial-fund">GoFundMe</a>.</p>
<p>1. Remsen, D. P., Norton, C., & Patterson, D. J. (2006). Taxonomic Informatics Tools for the Electronic Nomenclator Zoologicus. The Biological Bulletin, 210(1), 18–24. https://doi.org/10.2307/4134533</p>
<p>2. Patrick R. Leary, David P. Remsen, Catherine N. Norton, David J. Patterson, Indra Neil Sarkar, uBioRSS: Tracking taxonomic literature using RSS, Bioinformatics, Volume 23, Issue 11, June 2007, Pages 1434–1436, https://doi.org/10.1093/bioinformatics/btm109</p>
Roderic Pagehttp://www.blogger.com/profile/00269598293846172649noreply@blogger.comtag:blogger.com,1999:blog-16081779.post-79067736537060436832022-09-29T15:00:00.002+01:002022-09-29T15:00:01.682+01:00The ideal taxonomic journal<p>This is just some random notes on an “ideal” taxonomic journal, inspired in part by some recent discussions on “turbo-taxonomy” (e.g., <a href="https://doi.org/10.3897/zookeys.1087.76720">https://doi.org/10.3897/zookeys.1087.76720</a> and <a href="https://doi.org/10.1186/1742-9994-10-15">https://doi.org/10.1186/1742-9994-10-15</a>), and also examples such as the Australian Journal of Taxonomy <a href="https://doi.org/10.54102/ajt.qxi3r">https://doi.org/10.54102/ajt.qxi3r</a> which seems well-intentioned but limited.</p>
<h2 id="xml">XML</h2>
<p>One approach is to have highly structured text that embeds detailed markup, and ideally a tool that generates markup in XML. This is the approach taken by Pensoft. There is an inevitable trade-off between the burden on authors of marking up text versus making the paper machine readable. In some ways this seems misplaced effort given that there is little evidence that publications by themselves have much value (see <a href="https://iphylo.blogspot.com/2021/12/the-business-of-extracting-knowledge.html">The Business of Extracting Knowledge from Academic Publications</a>). “Value” in this case means as a source of data or factual statements that we can compute over. Human-readable text is not a good way to convey this sort of information.</p>
<p>It’s also interesting that many editing tools are going in the opposite direction, for example there are minimalist tools using <a href="https://en.wikipedia.org/wiki/Markdown">Markdown</a> where the goal is to <em>get out of the author’s way</em>, rather than impose a way of writing. Text is written by humans for humans, so the tools should be human-friendly.</p>
<p>The idea of publishing using XML is attractive in that it gives you XML that can be archived by, say, PubMed Central, but other than that the value seems limited. A cursory glance at download stats for journals that provide PDF and XML downloads, such as <em>PLoS One</em> and <em>ZooKeys</em>, PDF is by far the more popular format. So arguably there is little value in providing XML. Those who have tried to use JATS-XML as an authoring tool have not had a happy time: <a href="https://doi.org/10.7557/15.5517">How we tried to JATS XML</a>. However, there are various tools to help with the process, such as <a href="https://github.com/Vitaliy-1/docxToJats">docxToJats</a>,<br>
texture, and <a href="https://github.com/elifesciences/jats-xml-to-pdf">jats-xml-to-pdf</a> if this is the route one wants to take.</p>
<h2 id="automating-writing-manuscripts">Automating writing manuscripts</h2>
<p>The dream, of course, is to have a tool where you store all your taxonomic data (literature, specimens, characters, images, sequences, media files, etc.) and at the click of a button generate a paper. Certainly some of this can be automated, much nomenclatural and specimen information could be converted to human-readable text. Ideally this computer-generated text would not be edited (otherwise it could get out of sync with the underlying data). The text should be <a href="https://en.wikipedia.org/wiki/Transclusion">transcluded</a>. As an aside, one way to do this would be to include things such as lists of material examined as images rather than text while the manuscript is being edited. In the same way that you (probably) wouldn’t edit a photograph within your text editor, you shouldn’t be editing data. When the manuscript is published the data-generated portions can then be output as text.</p>
<p>Of course all of this assumes that we have taxonomic data in a database (or some other storage format, including plain text and Mark-down, e.g. <a href="https://iphylo.blogspot.com/2022/04/obsidian-markdown-and-taxonomic-trees.html">Obsidian, markdown, and taxonomic trees</a>) that can generate outputs in the various formats that we need.</p>
<h2 id="archiving-data-and-images">Archiving data and images</h2>
<p>One of the really nice things that <a href="http://plazi.org">Plazi</a> do is have a pipeline that sends taxonomic descriptions and images to Zenodo, and similar data to GBIF. Any taxonomic journal should be able to do this. Indeed, arguably each taxonomic treatment within the paper should be linked to the Zenodo DOI at the time of publication. Indeed, we could imagine ultimately having treatments as transclusions within the larger manuscript. Alternatively we could store the treatments as parts of the larger article (rather like chapters in a book), each with a CrossRef DOI. I’m still sceptical about whether these treatments are as important as we make out, see <a href="https://iphylo.blogspot.com/2022/09/does-anyone-cite-taxonomic-treatments.html">Does anyone cite taxonomic treatments?</a>. But having machine-readable taxonomic data archived and accessible is a good thing. Uploading the same data to GBIF makes much of that data immediately accessible. Now that GBIF offers <a href="https://www.gbif.org/composition/3kQFinjwHbCGZeLb5OhwN2/gbif-hosted-portals">hosted portals</a> there is the possibility of having custom interfaces to data from a particular journal.</p>
<h2 id="name-and-identifier-registration">Name and identifier registration</h2>
<p>We would also want automatic registration of new taxonomic names, for which there are pipelines (see “A common registration-to-publication automated pipeline for nomenclatural acts for higher plants (International Plant Names Index, IPNI), fungi (Index Fungorum, MycoBank) and animals (ZooBank)” <a href="https://doi.org/10.3897/zookeys.550.9551">https://doi.org/10.3897/zookeys.550.9551</a>). These pipelines do not seem to be documented in much detail, and the data formats differ across registration agencies (e.g., IPNI and ZooBank). For example, ZooBank seems to require TaxPub XML.</p>
<p>Registration of names and identifiers, especially across multiple registration agencies (ZooBank, CrossRef, DataCite, etc.) requires some coordination, especially when one registration agency requires identifiers from another.</p>
<h2 id="summary">Summary</h2>
<p>If data is key, then the taxonomic paper itself becomes something of a wrapper around that data. It still serves the function of being human-readable, providing broader context for the work, and as an archive that conforms to currently accepted ways to publish taxonomic names. But in some ways it is the last interesting part of the process.</p>
<blockquote>
<p>Written with <a href="https://stackedit.io/">StackEdit</a>.</p>
</blockquote>
Roderic Pagehttp://www.blogger.com/profile/00269598293846172649noreply@blogger.comtag:blogger.com,1999:blog-16081779.post-88726678098703065112022-09-14T11:12:00.002+01:002022-09-29T14:57:30.429+01:00DNA barcoding as intergenerational transfer of taxonomic knowledge<p>I <a href="https://twitter.com/rdmpage/status/1569738844416638981?s=21&t=9OVXuoUEwZtQt-Ldzlutfw">tweeted about this</a> but want to bookmark it for later as well. The paper “A molecular-based identification resource for the arthropods of Finland” <a href="https://doi.org/10.1111/1755-0998.13510">doi:10.1111/1755-0998.13510</a> contains the following:</p>
<blockquote>
<p>…the annotated barcode records assembled by FinBOL participants represent a tremendous <mark>intergenerational transfer of taxonomic knowledge</mark> … the time contributed by current taxonomists in identifying and contributing voucher specimens represents a great gift to future generations who will benefit from their expertise when they are no longer able to process new material.</p>
</blockquote>
<p>I think this is a very clever way to characterise the project. In an age of machine learning this may be commonest way to share knowledge , namely as expert-labelled training data used to build tools for others. Of course, this means the expertise itself may be lost, which has implications for updating the models if the data isn’t complete. But it speaks to Charles Godfrey’s theme of <a href="https://biostor.org/reference/250587">“Taxonomy as information science”</a>.</p>
<p>Note that the knowledge is also transformed in the sense that the underlying expertise of interpreting morphology, ecology, behaviour, genomics, and the past literature is not what is being passed on. Instead it is probabilities that a DNA sequence belongs to a particular taxon.</p>
<p>This feels is different to, say iNaturalist, where there is a machine learning model to identify images. In that case, the model is built on something the community itself has created, and continues to create. Yes, the underlying idea is that same: “experts” have labelled the data, a model is trained, the model is used. But the benefits of the <a href="https://www.inaturalist.org">iNaturalist</a> model are immediately applicable to the people whose data built the model. In the case of barcoding, because the technology itself is still not in the hands of many (relative to, say, digital imaging), the benefits are perhaps less tangible. Obviously researchers working with environmental DNA will find it very useful, but broader impact may await the arrival of citizen science DNA barcoding.</p>
<p>The other consideration is whether the barcoding helps taxonomists. Is it to be used to help prioritise future work (“we are getting lots of unknown sequences in these taxa, lets do some taxonomy there”), or is it simply capturing the knowledge of a generation that won’t be replaced:</p>
<blockquote>
<p>The need to capture such knowledge is essential because there are, for example, no young Finnish taxonomists who can critically identify species in many key groups of ar- thropods (e.g., aphids, chewing lice, chalcid wasps, gall midges, most mite lineages).</p>
</blockquote>
<p>The cycle of collect data, test and refine model, collect more data, rinse and repeat that happens with iNaturalist creates a feedback loop. It’s not clear that a similar cycle exists for DNA barcoding.</p>
<blockquote>
<p>Written with <a href="https://stackedit.io/">StackEdit</a>.</p>
</blockquote>
Roderic Pagehttp://www.blogger.com/profile/00269598293846172649noreply@blogger.comtag:blogger.com,1999:blog-16081779.post-8533163575910282432022-09-08T17:09:00.006+01:002022-09-08T18:24:22.408+01:00Local global identifiers for decentralised wikis<p>I've been thinking a bit about how one could use a Markdown wiki-like tool such as Obsidian to work with taxonomic data (see earlier posts <a href="https://iphylo.blogspot.com/2022/04/obsidian-markdown-and-taxonomic-trees.html">Obsidian, markdown, and taxonomic trees</a> and <a href="https://iphylo.blogspot.com/2020/08/personal-knowledge-graphs-obsidian-roam.html">Personal knowledge graphs: Obsidian, Roam, Wikidata, and Xanadu</a>).</p>
<p>One "gotcha" would be how to name pages. If we treat the database as entirely local, then the page names don't matter, but what if we envisage sharing the database, or merging it with others (for example, if we divided a taxon up into chunks, and different people worked on those different chunks)? </p>
<p>This is the attraction of globally unique identifiers. You and I can independently work on the same thing, such as data linked to scientific paper, safe in the knowledge that if we both use the DOI for that paper we can easily combine what we've done. But global identifiers can also be a pain, especially if we need to use a service to look them up ("is there a DOI for this paper?", "what is the LSID for this taxonomic name?").</p>
<p>Life would be easier if we could generate identifiers "locally", but had some assurance that they would be globally unique, and that anyone else generating an identifier for the same thing would arrive at the same identifier (this eliminates things such as <a href="https://en.wikipedia.org/wiki/Universally_unique_identifier">UUIDs</a> which are intentionally designed to prvent people genrrating the same identifier). One approach is "content addressing" (see, e.g. <a href="https://web.archive.org/web/20210514054054/https://bentrask.com/notes/content-addressing.html">Principles of Content Addressing</a> - dead link but in the Wayabck Machine, see also <a href="https://github.com/btrask/stronglink">btrask/stronglink</a>). For example, we can generate a cryptographic hash of a file (such as a PDF) and use that as the identifier.</p>
<p>Now the problem is that we have globally unique, but ugly and unfriendly identifiers (such as "6c98136eba9084ea9a5fc0b7693fed8648014505"). What we need are nice, easy to use identifiers we can use as page names. <a href="https://species.wikimedia.org/wiki/Main_Page">Wikispecies</a> serves as a possible role model, where taxon names serve as page names, as do simplified citations (e.g., authors and years). This model runs into the problem that taxon names aren't unique, nor are author + year combinations. In Wikispecies this is resolved by having a centralised database where it's first come, first served. If there is a name clash you have to create a new name for your page. This works, but what if you have multiple databases un by different people? How do we ensure the identifiers are the same?</p>
<p>Then I remembered Roger Hyam's flight of fantasy over a decade ago: <a href="http://www.hyam.net/blog/archives/1007">SpeciesIndex.org – an impractical, practical solution</a>. He proposed the following rules to generate a unique URI for a taxonomic name:
<ul>
<li>The URI must start with "http://speciesindex.org" followed by one or more of the following separated by slashes.</li>
<li>First word of name. Must only contain letters. Must not be the same as one of the names of the nomenclatural codes (icbn or iczn). Optional but highly recommended.</li>
<li>Second word of name. Must only contain letters and not be a nomenclatural code name. Optional.</li>
<li>Third word of name. Must only contain letters and not be a nomenclatural code name. Optional.</li>
<li>Year of publication. Must be an integer greater than 1650 and equal to or less than the current year. If this is an ICZN name then this should be the year the species (epithet) was published as is commonly cited after the name. If this is an ICBN name at species or below then it is the date of the combination. Optional. Recommended for zoological names if known. Not recommended for botanical names unless there is a known problem with homonyms in use by non-taxonomists.</li>
<li>Nomenclatural code governing the name of the taxon. Currently this must be either 'icbn' or 'iczn'. This may be omitted if the code is unknown or not relevant. Other codes may be added to this list.</li>
<li>Qualifier This must be a Version 4 RFC-4122 UUID. Optional. Used to generate a new independent identifier for a taxon for which the conventional name is unknown or does not exist or to indicate a particular taxon concept that bears the embedded name.</li>
<li>The whole speciesindex.org URI string should be considered case
sensitive. Everything should be lower case apart from the first letter of words that are specified as having upper case in their relevant codes e.g. names at and above the rank of genus.</li>
</ul>
</p>
<p>Roger is basically arging that while names aren't unique (i.e., we have homonyms such as <i>Abronia</i>) they are pretty close to being so, and with a few tweaks we can come up with a unique representation. Another way to think about this if we had a database of all taxonomics, we could construct a <a href="https://en.wikipedia.org/wiki/Trie">trie</a> and for each name find the shortest set of name parts (genus, species, etc), year, and code that gave us a unique string for that name. In many cases the species name may be all we need, in other cases we may need to add year and/or nomenclatural code to arrive at a unique string.
</p>
<p>What about bibliographic references? Well many of us will have databases (e.g., Endnote, Mendeley, Zotero, etc.) which generate "cite keys". These are typically short, memorable identifiers for a reference that are unique within that database. There is an interesting discussion on the <a href="https://discourse.jabref.org/t/universal-citekey-generator/2441/2">JabRef forum</a> regarding a "Universal Citekey Generator", and source code is available <a href="https://github.com/cparnot/universal-citekey-js">cparnot/universal-citekey-js</a>. I've yet to explore this in detail, but it looks a promising way to generate unique identifiers from basic metadata (echos of more elaborate schemes such as <a href="https://en.wikipedia.org/wiki/Serial_Item_and_Contribution_Identifier">SICIs</a>). For example,
<blockquote>Senna AR, Guedes UN, Andrade LF, Pereira-Filho GH. 2021. A new species of amphipod Pariphinotus Kunkel, 1910 (Amphipoda: Phliantidae) from Southwestern Atlantic. Zool Stud 60:57. doi:10.6620/ZS.2021.60-57.</blockquote>
becomes "Senna:2021ck". So if two people have the same, core, metadata for a paper they can generate the same key.</p>
<p>Hence it seems with a few conventions (and maybe some simple tools to support them) we could have decentralised wiki-like tools that used the same identifiers for the same things, and yet those identfiiers were short and human-friendly.</p>
Roderic Pagehttp://www.blogger.com/profile/00269598293846172649noreply@blogger.comtag:blogger.com,1999:blog-16081779.post-6076265289201848602022-09-01T17:49:00.001+01:002022-09-01T17:49:51.803+01:00Does anyone cite taxonomic treatments?<div class="separator" style="clear: both;"><a href="https://zenodo.org/record/5731100/thumb100" style="display: block; padding: 1em 0; text-align: center; clear: right; float: right;"><img alt="" border="0" height="128" data-original-height="106" data-original-width="100" src="https://zenodo.org/record/5731100/thumb250"/></a></div>
Taxonomic treatments have come up in various discussions I'm involved in, and I'm curious as to whether they are actually being used, in particular, whether they are actually being cited. Consider the following quote:
<blockquote>
The taxa are described in taxonomic treatments, well defined sections of scientific publications (Catapano 2019). They include a nomenclatural section and one or more sections including descriptions, material citations referring to studied specimens, or notes ecology and behavior. In case the treatment does not describe a new discovered taxon, previous treatments are cited in the form of treatment citations. This citation can refer to a previous treatment and add additional data, or it can be a statement synonymizing the taxon with another taxon. This allows building a citation network, and ultimately is a constituent part of the catalogue of life. - Taxonomic Treatments as Open FAIR Digital Objects <a href="https://doi.org/10.3897/rio.8.e93709">https://doi.org/10.3897/rio.8.e93709</a>
</blockquote>
<p>
"Traditional" academic citation is from article to article. For example, consider these two papers:
<blockquote>
Li Y, Li S, Lin Y (2021) Taxonomic study on fourteen symphytognathid species from Asia (Araneae, Symphytognathidae). ZooKeys 1072: 1-47. https://doi.org/10.3897/zookeys.1072.67935
</blockquote>
<blockquote>
Miller J, Griswold C, Yin C (2009) The symphytognathoid spiders of the Gaoligongshan, Yunnan, China (Araneae: Araneoidea): Systematics and diversity of micro-orbweavers. ZooKeys 11: 9-195. https://doi.org/10.3897/zookeys.11.160
</blockquote>
</p>
<p>Li et al. 2021 cites Miller et al. 2009 (although Pensoft seems to have broken the citation such that it does appear correctly either on their web page or in CrossRef).</p>
<p>So, we have this link: [article]10.3897/zookeys.1072.67935 --cites--> [article]10.3897/zookeys.11.160. One article cites another.</p>
<p>In their 2021 paper Li et al. discuss <i>Patu jidanweishi</i> Miller, Griswold & Yin, 2009:
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgPavMHXqQNX1ls_zXo9kIcMLHPxc7ZpV9NCSof5wLumrg3ovPoi6nYKzZsINuqFtYoEvW1QrerePD-MEf2DJaYUXlT37d81x3L6ILls7u229rg0_Nc0uUmgW-ICzr6MI_QCZfgQbYGTxuofu-fuPVoygbCnm3vQVYOhLDLtp1EtQ9jRZHDvw/s1040/Screenshot%202022-09-01%20at%2017.12.27.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" width="400" data-original-height="314" data-original-width="1040" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgPavMHXqQNX1ls_zXo9kIcMLHPxc7ZpV9NCSof5wLumrg3ovPoi6nYKzZsINuqFtYoEvW1QrerePD-MEf2DJaYUXlT37d81x3L6ILls7u229rg0_Nc0uUmgW-ICzr6MI_QCZfgQbYGTxuofu-fuPVoygbCnm3vQVYOhLDLtp1EtQ9jRZHDvw/s400/Screenshot%202022-09-01%20at%2017.12.27.png"/></a></div>
<p>There is a treatment for the original description of <i>Patu jidanweishi</i> at <a href="https://doi.org/10.5281/zenodo.3792232">https://doi.org/10.5281/zenodo.3792232</a>, which was created by Plazi with a time stamp "2020-05-06T04:59:53.278684+00:00". The original publication date was 2009, the treatments are being added retrospectively.</p>
<p>In an ideal world my expectation would be that Li et al. 2021 would have cited the treatment, instead of just providing the text string "Patu jidanweishi Miller, Griswold & Yin, 2009: 64, figs 65A–E, 66A, B, 67A–D, 68A–F, 69A–F, 70A–F and 71A–F (♂♀)." Isn't the expectation under the treatment model that we would have seen this relationship:</p>
<p>[article]10.3897/zookeys.1072.67935 --cites--> [treatment]https://doi.org/10.5281/zenodo.3792232</p>
<p>Furthermore, if it is the case that "[i]n case the treatment does not describe a new discovered taxon, previous treatments are cited in the form of treatment citations" then we should also see a citation between treatments, in other words Li et al.'s 2021 treatment of <i>Patu jidanweishi</i> (which doesn't seem to have a DOI but is available on Plazi' web site as <a href="https://tb.plazi.org/GgServer/html/1CD9FEC313A35240938EC58ABB858E74">https://tb.plazi.org/GgServer/html/1CD9FEC313A35240938EC58ABB858E74</a>) should also cite the original treatment? It doesn't - but it does cite the Miller et al. paper.</p>
<p>So in this example we don't see articles citing treatments, nor do we see treatments citing treatments. Playing Devil's advocate, why then do we have treatments? Does't the lack of citations suggest that - despite some taxonomists saying this is the unit that matters - they actually don't. If we pay attention to what people do rather than what they say they do, they cite articles.</p>
<p>Now, there are all sorts of reasons why we don't see [article] -> [treatment] citations, or [treatment] -> [treatment] citations. Treatments are being added after the fact by Plazi, not by the authors of the original work. And in many cases the treatments that could be cited haven't appeared until after that potentially citing work was published. In the example above the Miller et al. paper dates from 2009, but the treatment extracted only went online in 2020. And while there is a long standing culture of citing publications (ideally using DOIs) there isn't an equivalent culture of citing treatments (beyond the simple text strings).</p>
<p>Obviously this is but one example. I'd need to do some exploration of the citation graph to get a better sense of citations patterns, perhaps using <a href="https://www.crossref.org/documentation/event-data/">CrossRef's event data</a>. But my sense is that taxonomists don't cite treatments.</p>
<p>I'm guessing Plazi would respond by saying treatments are cited, for example (indirectly) in GBIF downloads. This is true, although arguably people aren't citing the treatment, they're citing specimen data in those treatments, and that specimen data could be extracted at the level of articles rather than treatments. In other words, it's not the treatments themselves that people are citing.</p>
<p>To be clear, I think there is value in being able to identify those "well defined sections" of a publication that deal with a given taxon (i.e., treatments), but it's not clear to me that these are actually the citable units people might hope them to be. Likewise, journals such as <i>ZooKeys</i> have DOIs for individual figures. Does anyone actually cite those?</p>
Roderic Pagehttp://www.blogger.com/profile/00269598293846172649noreply@blogger.comtag:blogger.com,1999:blog-16081779.post-81816752805863019422022-08-24T15:33:00.000+01:002022-08-24T15:38:28.699+01:00Can we use the citation graph to measure the quality of a taxonomic database?<p>More arm-waving notes on taxonomic databases. I've started to add data to <a href="https://www.checklistbank.org">ChecklistBank</a> and this has got me thinking about the issue of data quality. When you add data to ChecklistBank you are asked to give a measure of confidence based on the <a href="https://www.catalogueoflife.org/about/glossary.html#checklist-confidence">Catalogue of Life Checklist Confidence</a> system of one - five stars: ★ - ★★★★★. I'm scepetical about the notion of confidence or "trust" when it is reduced to a star system (see also <a href="https://iphylo.blogspot.com/2012/06/can-you-trust-eol.html">Can you trust EOL?</a>). I could literally pick any number of stars, there's no way to measure what number of stars is appropriate. This feeds into my biggest reservation about the <a href="https://www.catalogueoflife.org">Catalogue of Life</a>, it's almost entirely authority based, not evidence based. That is, rather than give us evidence for why a particular taxon is valid, we are (mostly) just given a list of taxa are asked to accept those as gospel, based on assertions by one or more authorities. I'm not necessarly doubting the knowledge of those making these lists, it's just that I think we need to do better than "these are the accepted taxa because I say so" implict in the Catalogue of Life.
</p>
<p>So, is there any way we could objectively measure the quality of a particular taxonomic checklist? Since I have a long standing interest in link the primary taxonomic litertaure to names in databases (since that's where the evidence is), I keep wondering whether measures based on that literture could be developed.
</p>
<p>
I recently revisited the fascinating (and quite old) literature on rates of synonymy:
</p>
<blockquote>
Gaston Kevin J. and Mound Laurence A. 1993 Taxonomy, hypothesis testing and the biodiversity crisisProc. R. Soc. Lond. B.251139–142
<a href="http://doi.org/10.1098/rspb.1993.0020">http://doi.org/10.1098/rspb.1993.0020</a>
</blockquote>
<blockquote>
Andrew R. Solow, Laurence A. Mound, Kevin J. Gaston, Estimating the Rate of Synonymy, Systematic Biology, Volume 44, Issue 1, March 1995, Pages 93–96, <a href="https://doi.org/10.1093/sysbio/44.1.93">https://doi.org/10.1093/sysbio/44.1.93</a>
</blockquote>
</p>
<p>
A key point these papers make is that the observed rate of synonymy is quite high (that is, many "new species" end up being merged with already known species), and that because it can take time to discover that a species is a synonym the actual rate may be even higher. In other words, in diagrams like the one reproduced below, the reason the proportion of synonyms declines the nearer we get to the present day (this paper came out in 1995) is not because are are creating fewer synonyms but because we've not yet had time to do the work to uncover the remaining synonyms.
</p>
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjhgwvBcDUD_IJZyAa0gaqXB--ogJeVooV19I635N_w2OuztDFsy-ZPXEFuW5S1NUxxUEHmN8pbdO9MPljIi5v0A2355kYotL1fKqdewP9kmWu9TfwtIJ3jD04SQjeF3SWK-yMVAx-rNc6TO43GIwmftPk6IghOrcmur6SoHe06ws7fEFAvxA/s621/Screenshot%202022-08-24%20at%2014.59.47.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" width="400" data-original-height="404" data-original-width="621" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjhgwvBcDUD_IJZyAa0gaqXB--ogJeVooV19I635N_w2OuztDFsy-ZPXEFuW5S1NUxxUEHmN8pbdO9MPljIi5v0A2355kYotL1fKqdewP9kmWu9TfwtIJ3jD04SQjeF3SWK-yMVAx-rNc6TO43GIwmftPk6IghOrcmur6SoHe06ws7fEFAvxA/s400/Screenshot%202022-08-24%20at%2014.59.47.png"/></a></div>
<p>Put another way, these papers are arguing that real work of taxonomy is revision, not species discovery, especially since it's not uncommon for > 50% of species in a taxon to end up being synonymised. Indeed, if a taxoomic group has few synonyms then these authors would argue that's a sign of neglect. More revisionary work would likely uncover additional synonyms. So, what we need is a way to measure the amount of research on a taxonomic group. It occurs to me that we could use the citation graph as a way to tackle this. Lets imagine we have a set of taxa (say a family) and we have all the papers that described new species or undertook revisions (or both). The extensiveness of that work could be measured by the citation graph. For example, build the citation graph for those papers. How many original species decsriptions are not cited? Those species have been potentially neglected. How many large-scale revisions have there been (as measured by the numbers of taxonomic papers those revisions cite)? There are some interesting approaches to quantifying this, such as using <a href="https://en.wikipedia.org/wiki/HITS_algorithm">hubs and authorities</a>.</p>
<p>I'm aware that taxonomists have not had the happiest relationship with citations:
<blockquote>
Pinto ÂP, Mejdalani G, Mounce R, Silveira LF, Marinoni L, Rafael JA. Are publications on zoological taxonomy under attack? R Soc Open Sci. 2021 Feb 10;8(2):201617. <a href="https://doi.org/10.1098/rsos.201617">doi: 10.1098/rsos.201617</a>. PMID: 33972859; PMCID: PMC8074659.
</blockquote>
Still, I think there is an intriguing possibility here. For this approach to work, we need to have linked taxonomic names to publications, and have citation data for those publications. This is happening on various platforms. Wikidata, for example, is becoming a repository of the taxonomic literature, some of it with citation links.
<blockquote>
Page RDM. 2022. Wikidata and the bibliography of life. PeerJ 10:e13712 <a href="https://doi.org/10.7717/peerj.13712">https://doi.org/10.7717/peerj.13712</a>
</blockquote>
Time for some experiments.
</p>Roderic Pagehttp://www.blogger.com/profile/00269598293846172649noreply@blogger.comtag:blogger.com,1999:blog-16081779.post-62240165848290070742022-08-22T18:19:00.000+01:002022-08-22T18:19:08.401+01:00Linking taxonomic names to the literatureJust some thoughts as I work through some datasets linking taxonomic names to the literature.
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi5ZkbnumEWKZpf0isei_hUJucFlOOK08-SKnJknD8B4qhAX36u-vMQRnZdhRJK5rPb7HNYcnqB7qE4agbeStqyzMkWHrzUj14gPkz2ohmbOVg8P_nHo0hM94PD1wH3SPJsaLAumN8vih3ch9pjH2RaVWZBLwwhGhNu0FS1m5z6j5xt2NeZ4w/s2140/linking%20to%20names144.jpg" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" height="600" data-original-height="2140" data-original-width="1604" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi5ZkbnumEWKZpf0isei_hUJucFlOOK08-SKnJknD8B4qhAX36u-vMQRnZdhRJK5rPb7HNYcnqB7qE4agbeStqyzMkWHrzUj14gPkz2ohmbOVg8P_nHo0hM94PD1wH3SPJsaLAumN8vih3ch9pjH2RaVWZBLwwhGhNu0FS1m5z6j5xt2NeZ4w/s600/linking%20to%20names144.jpg"/></a></div>
<p>In the diagram above I've tried to capture the different situatios I encounter. Much of the work I've done on this has focussed on case 1 in the diagram: I want to link a taxonomic name to an identifier for the work in which that name was published. In practise this means linking names to DOIs. This has the advantage of linking to a citable indentifier, raising questions such as whether citations of taxonmic papers by taxonomic databases could become part of a <a href="https://iphylo.blogspot.com/2022/08/papers-citing-data-that-cite-papers.html">taxonomist's Google Scholar profile</a>.</p>
<p>In many taxonomic databases full work-level citations are not the norm, instead taxonomists cite one or more pages within a work that are relevant to a taxonomic name. These "microcitations" (what the U.S. legal profession refer to as "point citations" or "pincites", see <a href="https://rasmussen.libanswers.com/faq/283203">What are pincites, pinpoints, or jump legal references?</a>) require some work to map to the work itself (which is typically the thing that has a citatble identifier such as a DOI).</p>
<p>Microcitations (case 2 in the diagram above) can be quite complex. Some might simply mention a single page, but others might list a series of (not necessarily contiguous) pages, as well as figures, plates etc. Converting these to citable identifiers can be tricky, especially as in most cases we don't have page-level identifiers. The Biodiversity Heritage Library (BHL) does have URLs for each scanned page, and we have a standard for referring to pages in a PDF (<code>page=<pageNum></code>, see <a href="https://datatracker.ietf.org/doc/html/rfc8118">RFC 8118</a>). But how do we refer to a set of pages? Do we pick the first page? Do we try and represent a set of pages, and if so, how?</p>
<p>Another issue with page-level identifiers is that not everything on a given page may be relevant to the taxonomic name. In case 2 above I've shaded in the parts of the pages and figure that refer to the taxonomic name. An example where this can be problematic is the recent test case I created for BHL where a page image was included for the taxonomic name <a href="https://www.gbif.org/species/195763322"><i>Aphrophora impressa</i></a>. The image includes the species description and a illustration, as well as text that relates to other species.</p>
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1SAv1XiVBPHXeMLPfdsnTh4Sj4m0AQyqjXM0faXyvbNhBXFDl7SawjFIkMo3qz4RQhDEyZXkKAh5nF4gtfHbVXA6cK3NJ46CuIWECC6HmJKtTjoZ0M3QQXvLY_X-2-RecjBqEy68M0cEr0ph3l6KY51kA9BvGt9d4id314P71PBitpWMATg/s3467/https---www.biodiversitylibrary.org-pageimage-29138463.jpeg" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" height="400" data-original-height="3467" data-original-width="2106" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1SAv1XiVBPHXeMLPfdsnTh4Sj4m0AQyqjXM0faXyvbNhBXFDl7SawjFIkMo3qz4RQhDEyZXkKAh5nF4gtfHbVXA6cK3NJ46CuIWECC6HmJKtTjoZ0M3QQXvLY_X-2-RecjBqEy68M0cEr0ph3l6KY51kA9BvGt9d4id314P71PBitpWMATg/s400/https---www.biodiversitylibrary.org-pageimage-29138463.jpeg"/></a></div>
<p>Given that not everything on a page need be relevant, we could extract just the relevant blocks of text and illustrations (e.g., paragraphs of text, panels within a figure, etc.) and treat that set of elements as the thing to cite. This is, of course, what <a href="http://plazi.org">Plazi</a> are doing. The set of extracted blocks is glued together as a "treatment", assigned an identifier (often a DOI), and treated as a citable unit. It would be interesting to see to what extent these treatments are actually cited, for example, do subsequent revisions that cite work that include treatments cite those treatments, or just the work itself? Put another way, are we creating <a href="https://iphylo.blogspot.com/2012/09/decoding-nature-encode-ipad-app-omg-it.html">"threads"</a> between taxonomic revisions?</p>
<p>One reason for these notes is that I'm exploring uploading taxonomic name - literature links to <a href="https://www.checklistbank.org">ChecklistBank</a> and case 1 above is easy, as is case 3 (if we have treatment-level identifiers). But case 2 is problematic because we are linking to a set of things that may not have an identifier, which means a decision has to be made about which page to link to, and how to refer to that page.</p>
Roderic Pagehttp://www.blogger.com/profile/00269598293846172649noreply@blogger.comtag:blogger.com,1999:blog-16081779.post-16006720376775875532022-08-03T12:33:00.002+01:002022-08-03T12:33:13.823+01:00Papers citing data that cite papers: CrossRef, DataCite, and the Catalogue of LifeQuick notes to self following on from a conversation about linking taxonomic names to the literature.
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Is there a way to turn those links into countable citations (even if just one per database) for Google Scholar?</p>— Wayne Maddison (@WayneMaddison) <a href="https://twitter.com/WayneMaddison/status/1554644747406348288?ref_src=twsrc%5Etfw">August 3, 2022</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
There are different sorts of citation:
<ol>
<li>Paper cites another paper</li>
<li>Paper cites a dataset</li>
<li>Dataset cites a paper</li>
</ol>
Citation type (1) is largely a solved problem (although there are issues of the ownership and use of this data, see e.g. <a href="https://iphylo.blogspot.com/2020/07/zootaxa-has-no-impact-factor.html">Zootaxa has no impact factor</a>.
Citation type (2) is becoming more widespread (but not perfect as GBIF's <a href="https://twitter.com/search?q=%23citethedoi&src=typed_query">#citethedoi</a> campaign demonstrates. But the idea is well accepted and there are guides to how to do it, e.g.:
<blockquote>
Cousijn, H., Kenall, A., Ganley, E. et al. A data citation roadmap for scientific publishers. Sci Data 5, 180259 (2018). <a href="https://doi.org/10.1038/sdata.2018.259">https://doi.org/10.1038/sdata.2018.259</a>
</blockquote>
However, things do get problematic because most (but not all) DOIs for publications are managed by CrossRef, which has an extensive citation database linking papers to other paopers. Most datasets have DataCite DOIs, and DataCite manages its own citations links, but as far as I'm aware these two systems don't really taklk to each other.
Citation type (3) is the case where a database is largely based on the literature, which applies to taxonomy. Taxonomic databases are essentially collections of literature that have opinions on taxa, and the database may simply compile those (e.g., a nomenclator), or come to some view on the applicability of each name. In an ideal would, each reference included in a taxonomic database would gain a citation, which would help better reflect the value of that work (a long standing bone of contention for taxonomists).
It would be interesting to explore these issues further. CrossRef and DataCite do share <a href="https://www.crossref.org/services/event-data/">Event Data</a> (see also <a href="https://support.datacite.org/docs/eventdata-guide">DataCite Event Data</a>). Can this track citations of papers by a dataset?
My take on Wayne's question:
<blockquote>
Is there a way to turn those links into countable citations (even if just one per database) for Google Scholar?
</blockquote>
is that he's is after type 3 citations, which I don't think we have a way to handle just yet (but I'd need to look at Event Data a bit more). Google Scholar is a black box, and the academic coimmunity's reliance on it for metrics is troubling. But it would be interetsing to try and figure out if there is a way to get Google Scholar to index the citations of taxonomic papers by databases. For instance, the <a href="https://www.catalogueoflife.org/">Catalogue of Life</a> has an ISSN <a href="https://portal.issn.org/resource/ISSN/2405-884X">2405-884X</a> so it can be treated as a publication. At the moment its web pages have lots of identifiers for people managing data and their organisations (lots of <a href="https://orcid.org">ORCIDs</a> and <a href="https://ror.org">RORs</a>, and DOIs for individual datasets (e.g., <a href="https://www.checklistbank.org/dataset/9828/about">checklistbank.org</a>) but precious little in the way of DOIs for publications (or, indeed, ORCIDs for taxonomists). What would it take for taxonomic publications in the Catalogue of Life to be treated as first class citations?
Roderic Pagehttp://www.blogger.com/profile/00269598293846172649noreply@blogger.comGlasgow, UK55.864237 -4.25180635.748952987896374 -39.408056 75.979521012103632 30.904443999999998tag:blogger.com,1999:blog-16081779.post-44720705306513207782022-05-27T17:34:00.001+01:002022-05-27T17:34:19.487+01:00Round trip from identifiers to citations and back again<p>Note to self (basically rewriting last year's <a href="https://iphylo.blogspot.com/2021/05/finding-citations-of-specimens.html">Finding citations of specimens</a>).</p>
<p>Bibliographic data supports going from identifier to citation string and back again, so we can do a "round trip."</p>
<h2>1.</h2>
<p>Given a DOI we can get structured data with a simple HTTP fetch, then use a tool such as <a href="https://citation.js.org">citation.js</a> to convert that data into a human-readable string in a variety of formats.</p>
<table>
<tr>
<th>
Identifier
</th>
<th>
⟶
</th>
<th>
Structured data
</th>
<th>
⟶
</th>
<th>
Human readable string
</th>
<tr>
<tr>
<td>
10.7717/peerj-cs.214
</td>
<td>
HTTP with content-negotiation
</td>
<td>
CSL-JSON
</td>
<td>
CSL templates
</td>
<td width="25%">
Willighagen, L. G. (2019). Citation.js: a format-independent, modular bibliography tool for the browser and command line. PeerJ Computer Science, 5, e214. https://doi.org/10.7717/peerj-cs.214
</td>
</tr>
</table>
<h2>2.</h2>
<p>Going in the reverse direction (string to identifier) is a little more challenging. In the "old days" a typical strategy was to attempt to parse the citation string into structured data (see <a href="hhtps://anystyle.io">AnyStyle</a> for a nice example of this), then we could extract a truple of (journal, volume, starting page) and use that to query CrossRef to find if there was an article with that tuple, which gave us the DOI.</p>
<table>
<tr>
<th>
Identifier
</th>
<th>
⟵
</th>
<th>
Structured data
</th>
<th>
⟵
</th>
<th>
Human readable string
</th>
<tr>
<tr>
<td>
10.7717/peerj-cs.214
</td>
<td>
OpenURL query
</td>
<td>
journal, volume, start page
</td>
<td>
Citation parser
</td>
<td width="25%">
Willighagen, L. G. (2019). Citation.js: a format-independent, modular bibliography tool for the browser and command line. PeerJ Computer Science, 5, e214. https://doi.org/10.7717/peerj-cs.214
</td>
</tr>
</table>
<h2>3.</h2>
<p>Another strategy is to take all the citations strings for each DOI, index those in a search engine, then just use a simple search to find the best match to your citation string, and hence the DOI. This is what <a href="https://search.crossref.org">https://search.crossref.org</a> does.</p>
<table>
<tr>
<th>
Identifier
</th>
<th>
⟵
</th>
<th>
Human readable string
</th>
<tr>
<tr>
<td>
10.7717/peerj-cs.214
</td>
<td>
search
</td>
<td width="50%">
Willighagen, L. G. (2019). Citation.js: a format-independent, modular bibliography tool for the browser and command line. PeerJ Computer Science, 5, e214. https://doi.org/10.7717/peerj-cs.214
</td>
</tr>
</table>
<p>At the moment my work on material citations (i.e., lists of specimens in taxonomic papers) is focussing on 1 (generating citations from specimen data in GBIF) and 2 (parsing citations into structured data).</p>Roderic Pagehttp://www.blogger.com/profile/00269598293846172649noreply@blogger.comtag:blogger.com,1999:blog-16081779.post-5316813576276551352022-05-11T17:53:00.001+01:002022-05-12T11:10:05.917+01:00Thoughts on TreeBASE dying(?)<blockquote class="twitter-tweet"><p lang="en" dir="ltr"><a href="https://twitter.com/rvosa?ref_src=twsrc%5Etfw">@rvosa</a> is Naturalis no longer hosting Treebase? <a href="https://t.co/MBRgcxaBmR">https://t.co/MBRgcxaBmR</a></p>— Hilmar Lapp (@hlapp) <a href="https://twitter.com/hlapp/status/1524166490798309381?ref_src=twsrc%5Etfw">May 10, 2022</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>So it looks like <a href="http://treebase.org">TreeBASE</a> is in trouble, it's legacy Java code a victim of security issues. Perhaps this is a chance to rethink TreeBASE, assuming that a repository of published phylogenies is still considered a worthwhile thing to have (and I think that question is open).</p>
<p>Here's what I think could be done.</p>
<ol>
<li>
The data (individual studies with trees and data) are packaged into whatever format is easiest (NEXUS, XML, JSON) and uploaded to a repository such as <a href="https://zenodo.org">Zenodo</a> for long term storage. They get DOIs for citability. This becomes the default storage for TreeBASE.
</li>
<li>
The data is transformed into JSON and indexed using Elasticsearch. A simple web interface is placed on top so that people can easily find trees (never a strong point of the original TreeBASE). Trees are displayed natively on the web using SVG. The number one goal is for people to be able to find trees, view them, and download them.
</li>
<li>
To add data to TreeBASE the easiest way would be for people to upload them direct to Zenodo and tag them "treebase". A bot then grabs a feed of these datasets and adds them to the search engine in (1) above. As time allows, add an interface where people upload data directly, it gets curated, then deposited in Zenodo. This presupposes that there are people available to do curation. Maybe have "stars" for the level of curation so that users know whether anyone has checked the data.
</li>
</ol>
<p>There's lots of details to tweak, for example how many of the existing URLs for studies are preserved (some URL mapping), and what about the API? And I'm unclear about the relationship with <a href="https://datadryad.org">Dryad</a>.</p>
<p>My sense is that the TreeBASE code is very much of its time (10-15 years ago), a monolithic block of code with SQL, Java, etc. If one was starting from scratch today I don't think this would be the obvious solution. Things have trended towards being simpler, with lots of building blocks now available in the cloud. Need a search engine? Just spin up a container in the cloud and you have one. More and more functionality can be devolved elsewhere.</p>
<p>Another other issue is how to support TreeBASE. It has essentially been a volunteer effort to date, with little or no funding. One reason I think having Zenodo as a storage engine is that it takes care of long term sustainability of the data.</p>
<p>I realise that this is all wild arm waving, but maybe now is the time to reinvent TreeBASE?</p>
<h2>Updates</h2>
<p>It's been a while since I've paid a lot of attention to phylogenetic databases, and it shows. There is a file-based storage system for phylogenies <a href="https://github.com/OpenTreeOfLife/phylesystem-1">phylesystem</a> (see "Phylesystem: a git-based data store for community-curated phylogenetic estimates" <a href="https://doi.org/10.1093/bioinformatics/btv276">https://doi.org/10.1093/bioinformatics/btv276</a>) that is sort of what I had in mind, although long term persistence is based on GitHub rather than a repository such as Zenodo. Phylesystem uses a truly horrible-looking JSON transformation of <a href="http://nexml.github.io">NeXML</a> (NeXML itself is ugly), and TreeBASE also supports NeXML, so some form of NeXML or a JSON transformation seems the obvious storage format. It will probably need some cleaning and simplification if it is to be indexed easily. Looking back over the long history of TreeBASE and phylogenetic databases I'm struck by how much complexity has been introduced over time. I think the tech has gotten in the way sometimes (which might just be another way of saying that I'm not smart enough to make sense of it all.</p>
<p>So we could imagine a search engine that covers both TreeBASE and <a href="https://tree.opentreeoflife.org/curator">Open Tree of Life studies</a>.</p>
<p>Basic metadata-based searches would be straightforward, and we could have a user interface that highlights the trees (I think TreeBASE's biggest search rival is a Google image search). The harder problem is searching by tree structure, for which there is an interesting literature without any decent implementations that I'm aware of (as I said, I've been out of this field a while).</p>
<p>So my instinct is we could go a long way with simply indexing JSON (CouchDB or Elasticsearch), then need to think a bit more cleverly about higher taxon and tree based searching. I've always thought that one killer query would be not so much "show me all the trees for my taxon" but "show me a synthesis of the trees for my taxon". Imagine a supertree of recent studies that we could use as a summary of our current knowledge, or a visualisation that summarises where there are conflicts among the trees.</p>
<h3>Relevant code and sites</h3>
<ul>
<li><a href="https://github.com/rdmpage/cdaotools">CDAO Tools</a>, see "CDAO-Store: Ontology-driven Data Integration for Phylogenetic Analysis" <a href="https://doi.org/10.1186/1471-2105-12-98">https://doi.org/10.1186/1471-2105-12-98</a></li>
<li><a href="https://github.com/NESCent/phylocommons">PhyloCommons</a></li>
</ul>Roderic Pagehttp://www.blogger.com/profile/00269598293846172649noreply@blogger.comtag:blogger.com,1999:blog-16081779.post-62305139514577126532022-04-07T22:07:00.001+01:002022-04-07T22:15:34.868+01:00Obsidian, markdown, and taxonomic trees<p>Returning to the subject of <a href="https://iphylo.blogspot.com/2020/08/personal-knowledge-graphs-obsidian-roam.html">personal knowledge graphs</a> Kyle Scheer has an interesting repository of Markdown files that describe academic disciplines at <a href="https://github.com/kyletscheer/academic-disciplines">https://github.com/kyletscheer/academic-disciplines</a> (see <a href="https://kyletscheer.medium.com/on-creating-a-tree-of-knowledge-f099c1028bf6">his blog post</a> for more background).</p>
<p>If you add these files to <a href="https://obsidian.md/">Obsidian</a> you get a nice visualisation of a taxonomy of academic disciplines. The applications of this to biological taxonomy seem obvious, especially as a tool like Obsidian enables all sorts of interesting links to be added (e.g., we could add links to the taxonomic research behind each node in the taxonomic tree, the people doing that research, etc. - although that would mean we'd no longer have a simple tree).</p>
<p>The more I look at these sort of simple Markdown-based tools the more I wonder whether we could make more use of them to create simple but persistent databases. Text files seem the most stable, long-lived digital format around, maybe this would be a way to minimise the inevitable obsolescence of database and server software. Time for some experiments I feel... can we take a taxonomic group, such as mammals, and create a richly connected database purely in Markdown?</p>
<div class="separator" style="clear: both; text-align: center;"><iframe allowfullscreen='allowfullscreen' webkitallowfullscreen='webkitallowfullscreen' mozallowfullscreen='mozallowfullscreen' width='400' height='322' src='https://www.blogger.com/video.g?token=AD6v5dyo5SkPa5QlBxFGN31zgm8prUUvjfccuDxyk2EvKFboYpQ7dYZzN3u9gJ9sG0K2bp7tJE94-Bz4WeA' class='b-hbp-video b-uploaded' frameborder='0'></iframe></div>
Roderic Pagehttp://www.blogger.com/profile/00269598293846172649noreply@blogger.com