Tuesday, September 01, 2009

Google, Wikipedia, and EOL

One assumption I've been making so far is that when people search for information on an organism using its scientific name, Wikipedia will dominate the search results (see my earlier post for an example of this assumption). I've decided to quantify this by doing a little experiment. I grabbed the Mammal Species of the World taxonomy and extracted the 5416 species names. I then used Google's AJAX search API to look up each name in Google. For each search I took the top 10 hits and recorded for each hit the site URL and the rank in the search results (i.e., 1-10). Below is a table of how many mammal species had a hit in the top 10 Google results (showing just the top 20 most frequent sites).

Wikipedia is the clear winner, with 5266 (97%) of mammals having a Wikipedia page in the top ten Google results. Next comes Wikispecies, then Animal Diversity Web, Wikimedia Commons, ITIS, the Comparative Toxicogenomics Database, BioOne, UniProt (derived from the NCBI taxonomy), and so on. Note that the Encyclopedia of Life comes in 17th.

Things get more interesting if we look at the ranking of search results. The graph below plots the cumulative rank of search results for some of the web sites listed above.

Wikipedia dominates things. For 48% of all mammal species Wikipedia is the first result returned by Google. Just under three quarters of all mammal species are either the first or second top hit in Google. The next best sites are Animal Diversity Web and Wikispecies, which get a small share of first place for some species (19% and 7% respectively). Note that EOL pages manage to make it into the top 10 for only 11% of all mammal species.

What does this all mean? Well, it seems clear that if people are using Google to find information about an organism, then Wikipedia is more likely than anything else to be the first result they see. It is also interesting that for all the energy (and funds) being expended on biodiversity databases (doi:10.1126/science.324_1632), ITIS is the only classical biodiversity database that routinely gets found in these searches (albeit in only a quarter of the searches).

I know I tend to go on a bit about EOL, but if I was running (or funding) EOL, I'd be worried. EOL barely figures in these search results, and is being taken to the cleaners by a volunteer effort (Wikipedia). Furthermore, it seems difficult to envisage what EOL can do to improve things. Sure it can link to (and make use of) content in sites such as Animal Diversity Web, ITIS (and maybe even, gasp, Wikipedia), but that just adds "link love" to those sites. Ironically, perhaps the single thing that would improve EOL's ranking would be if Wikipedia spread some of its link love over EOL, by linking all it's taxon pages to the corresponding EOL page.

But there are bigger issues at stake. Site popularity on the web tends to follow a power law, where a very few web sites grab the vast majority of eye balls. In a old blog post Clay Shirky wrote:

Now, thanks to a series of breakthroughs in network theory by researchers ... we know that power law distributions tend to arise in social systems where many people express their preferences among many options. We also know that as the number of options rise, the curve becomes more extreme. This is a counter-intuitive finding - most of us would expect a rising number of choices to flatten the curve, but in fact, increasing the size of the system increases the gap between the #1 spot and the median spot.

So, creating new and improved biodiversity web sites is likely to have the effect of only increasing the gap between Wikipedia and the rest.

Lastly, as I've mentioned before regarding Wikipedia and citations of taxonomic work, the graph above suggests to me that for anybody wanting to make basic biodiversity information available on the web, and attract readers to basic taxonomic literature, there really is only one game in town.