Showing posts with label mammals. Show all posts
Showing posts with label mammals. Show all posts

Sunday, January 02, 2022

Large graph viewer experiments

I keep returning to the problem of viewing large graphs and trees, which means my hard drive has accumulated lots of failed prototypes. Inspired by some recent discussions on comparing taxonomic classifications I decided to package one of these (wildly incomplete) prototypes up so that I can document the idea and put the code somewhere safe.

Google Maps-like viewer

I've created a simple viewer that uses a tiled map viewer (like Google Maps) to display a large graph. The idea is to draw the entire graph scaled to a 256 x 256 pixel tile. The graph is stored in a database that supports geospatial queries, which means the queries to retrieve the individual tiles need to display the graph at different levels of resolution are simply bounding box queries to a database. I realise that this description is cryptic at best. The GitHub repository https://github.com/rdmpage/gml-viewer has more details and the code itself. There's a lot to do, especially adding support for labels(!) which presents some interesting challenges (levels of detail and generalization). The code doesn't do any layout of the graph itself, instead I've used the yEd tool to compute the x,y coordinates of the graph.

Since this exercise was inspired by a discussion of the ASM Mammal Diversity Database, the graph I've used for the demonstration above is the ASM classification of extant mammals. I guess I need to solve the labelling issue fairly quickly!

Tuesday, September 01, 2009

Wikipedia mammals and the power law

Playing a bit more with the Wikipedia mammal data, there are some interesting patterns to note. The first is that rank the mammal pages by size (here defined as the number of characters in the source for the page) and plot size against rank then we get a graph that looks very much like a power law:
pow1.png

There are a few large pages on mammals (these are on the left), and lots of small pages (the long tail on the right). If we do a log-log plot we get this:
pow2.png

The straight line is characteristic of a power law. The dip at the far right reflects the fact that Wikipedia pages have a minimum size (for example, they must include a Taxobox). Now, this is a bit crude (I should probably look at "Power-law distributions in empirical data" arXiv:0706.1062v2 before getting too carried away), but power laws are characteristic of the link structure of the web (a few big sites with huge numbers of links, huge numbers of sites with few links), and indeed of at least parts of Wikipedia, such as the Gene Wiki project (see doi:10.1371/journal.pbio.0060175).

In this context, the diagrams are showing that even if mammals are "charismatic megafauna", most of them aren't that charismatic. Wikipedia mammal pages are mostly small. This raises the question of whether the high frequency in which Wikipedia mammal pages appeared in the top of Google searches might be attributed to those large pages on (presumably) charismatic mammals. If this were the case, then we'd expect that small pages wouldn't rank highly in Google searches. So, I plotted page size against Google search rank for the Wikipedia mammal pages:
sizexrank.png

This is a box plot, where the grey boxes represent 50% of the distribution of page size (the horizontal black line is the median), and extreme values are shown as circles. Note that "0" is the highest rank (i.e., the first hit in Google), and 9 is the lowest.

While, not surprisingly, most large Wikipedia pages do well in Google searches, and rarely are large pages low down the rankings, my sense is that small pages can have any rank, from top (0) to bottom (9). If page size (i.e., which is a crude measure of the effort put into editing a Wikipedia page) is a measure of "charisma" (contributors are more likely to edit pages on animals that lots of people know about), then charisma isn't a great predictor of where you come in Google's search results. It's not about size, it's about being in Wikipedia.

Monday, August 31, 2009

Comparing Wikipedia and Mammal Species of the World classifications



Continuing the saga of making sense of the mammal classification in Wikipedia, I've done a quick comparison with the Mammal Species of the World (third edition) classification. MSW is the default taxonomic reference used by WikiProject Mammals. I downloaded the MSW taxonomy as a CSV file (warning, it's big), and wrote a script to pull out the classification as a GML file (my preferred graph format).

Based on some earlier work with Gabriel Valiente, I wrote a simple program that takes two trees and highlights the nodes in common to the two trees. I then input into this program the MSW tree, and the largest component of the graph of Wikipedia mammals. The MSW tree has 13582 nodes, the Wikipedia tree has 6287. Note that Wikipedia has more taxa than these 6287 nodes suggest, but they aren't connected to the largest tree (often due to intermediate nodes in the classification lacking a page in Wikipedia). The two trees have 4935 nodes in common (again, this number will be a little low, there are some weird taxon names due to problems parsing Wikipedia).

MSW versus Wikipedia
Below is a the MSW classification with taxa in Wikipedia shown in red.
w-msw.jpg


[Larger scale view here]

The impression given is that most Wikipedia mammal pages are in MSW, with some notable exceptions, including higher level taxa such as Afrotheria, and extinct taxa such as the Multituberculata. Some extant taxa are missing due to synonymy. For example, Wikipedia gives the scientific name of Anthony's pipistrelle as Pipistrellus anthonyi, whereas MSW has it as Hypsugo anthonyi.
As an aside, Wikipedia pages often get muddled about parentheses around taxonomic author names. The authority is in parentheses if the current genus is not the original genus the species was placed. Hence, Pipistrellus anthonyi (Tate, 1942) should actually be Pipistrellus anthonyi Tate, 1942, as Tate originally described this taxon as a species of Pipistrellus (see hdl:2246/1783). However, the name Hypsugo anthonyi (Tate, 1942) does need parentheses.


Some Wikipedia taxa also postdate the publication of MSW, such as Philander deltae (see doi:10.1644/05-MAMM-A-065R2.1).


Wikipedia versus MSW
When we do the reverse comparison we see something rather different.

msw-w.jpg


[Larger scale view here]

This is the MSW tree, coloured red where the MSW taxon has a page in Wikipedia. There are big gaps, some of which are due to those pages being in another component (in other words, many "missing" taxa do have pages in Wikipedia, they are just not properly linked to the bigger tree). MSW is also rich in subspecies, which tend to lack their own pages in Wikipedia (possibly a good thing in the cases of taxa such as pocket gophers).

It would be nice to make these comparisons automatic, and develop tools so that managing taxonomy in Wikipedia could be made easier.

Saturday, August 29, 2009

Mammal tree from Wikipedia

Following on from my previous post about visualising the mammalian classification in Wikipedia, I've extracted the largest component from the graph for all mammal taxa in Wikipedia, and it is a tree. This wasn't apparent in the previous diagram, where the component appeared as a big ball due to the layout algorithm used.
tree.jpg


What this suggests is that Wikipedia contributors are quite capable of generating trees, it's just that not all the bits of the tree are connected (hence all the components in the previous post.

As Cyndy Parr suggested in her comments, it would be useful to compare the Wikipedia-derived tree with other trees, say from Mammal species of the World or ITIS.