Web Graph Topology

References

http://webdatacommons.org/hyperlinkgraph/2014-04/topology.html
- This document provides basic statistics about the topology of the Web Data Commons - Hyperlink Graph extracted from the Common Crawl Corpus released in April 2014, covering 1.7 billion web pages and 64 billion hyperlinks between these pages. The graph was extracted from the Spring 2014 web corpus of the Common Crawl Foundation. This corpus was gathered using a modified Apache Nutch crawler to gather pages from a seed list without discovering links while crawling. The seed list,containing around 6 billion URLs was provided by the search engine company blekko where the crawl corpus contains around 2 billion of those pages. The Common Crawl Foundation and blekko started a cooperation in 2013 to increase the quality and popularity of the crawled pages and reduce the number of crawled spam pages and crawler traps.
  The graph of 2014 is well connected (91% of all nodes) the largest strongly connected component consists of only 19% of all nodes, which is most like due to the selected crawling strategy. Nevertheless we encourage the usage of this graph datasets but for analysis of the connectivity of the pages within the web and structural studies we recommend to use the 2012 graph dataset, as a BFS based selection strategy including link discovery while crawling most likely results in a more realistic sample of the structure of the Web.