Using 80legs for competitive intelligence

This very good writeup of 80legs makes it sound like it might be useful for competitive intelligence. 80legs offers cloud-based processing of data from massive web crawls; you tell it which websites to crawl and what to look for, and 80legs does this in the cloud "overnight". This sounds a bit similar to something Aqute has done a few times, for example executing searches like this one across a lot of Google search results (using the Google API).

80legs does seem quite simple to use, and at least for now it is possible to set up and run queries without paying. A simple test query (counting the number of times that the word "green" appears on the BMW and Toyota websites), took 13 hours to run. The query required 0.6 hours of CPU time, spread out over these 13 hours. The results were as follows:

  • 80legs crawled around 39,5000 web pages across these two websites.
  • 6,500 BMW pages and 33,000 Toyota pages.
  • 81 BMW pages (1%) included the word "green", vs 2,711 Toyota pages (8%).

It would seem Toyota is 8x greener than BMW. Of course, the purpose of the exercise was to test 80legs: the actual results in this example are a little facile.

The conclusion? 80legs seems interesting. Some of the results did not seem to tally with the actual web pages (pages that were reported as including the word "green", did not; and pages that were reported to include the word "green" ten times only included it three times). This may be due in part to our incompetence in trying out this new tool, however. The time taken was also surprising - the original review touts 80legs as suitable for processing millions of Facebook records. This might take a very long time if 80legs needs 13 hours to process 50,000 web pages. However, massive cloud-based analysis could be very useful and 80legs does deserve to be tested and given a chance to impress.