Tuesday, September 01, 2009

Wikipedia mammals and the power law

Playing a bit more with the Wikipedia mammal data, there are some interesting patterns to note. The first is that rank the mammal pages by size (here defined as the number of characters in the source for the page) and plot size against rank then we get a graph that looks very much like a power law:

There are a few large pages on mammals (these are on the left), and lots of small pages (the long tail on the right). If we do a log-log plot we get this:

The straight line is characteristic of a power law. The dip at the far right reflects the fact that Wikipedia pages have a minimum size (for example, they must include a Taxobox). Now, this is a bit crude (I should probably look at "Power-law distributions in empirical data" arXiv:0706.1062v2 before getting too carried away), but power laws are characteristic of the link structure of the web (a few big sites with huge numbers of links, huge numbers of sites with few links), and indeed of at least parts of Wikipedia, such as the Gene Wiki project (see doi:10.1371/journal.pbio.0060175).

In this context, the diagrams are showing that even if mammals are "charismatic megafauna", most of them aren't that charismatic. Wikipedia mammal pages are mostly small. This raises the question of whether the high frequency in which Wikipedia mammal pages appeared in the top of Google searches might be attributed to those large pages on (presumably) charismatic mammals. If this were the case, then we'd expect that small pages wouldn't rank highly in Google searches. So, I plotted page size against Google search rank for the Wikipedia mammal pages:

This is a box plot, where the grey boxes represent 50% of the distribution of page size (the horizontal black line is the median), and extreme values are shown as circles. Note that "0" is the highest rank (i.e., the first hit in Google), and 9 is the lowest.

While, not surprisingly, most large Wikipedia pages do well in Google searches, and rarely are large pages low down the rankings, my sense is that small pages can have any rank, from top (0) to bottom (9). If page size (i.e., which is a crude measure of the effort put into editing a Wikipedia page) is a measure of "charisma" (contributors are more likely to edit pages on animals that lots of people know about), then charisma isn't a great predictor of where you come in Google's search results. It's not about size, it's about being in Wikipedia.