Common Crawl Dataset

A Common Crawl Dataset is a web snapshot created by a Common Crawl Foundation.

Context:
- It can (typically) be created by a Web Crawler (based on Nutch).
- It can (typically) be a member of the Common Crawl Corpus.
- It can be used to derive a Web Data Commons Dataset (such as their Web hyperlink graph)
Example(s):
- s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-35/
- s3://aws-publicdatasets/common-crawl/crawl-002/, with 5+ billion webpage records.
- …
Counter-Example(s):
- an Archive.org dataset.
- a Wikipedia Snapshot.
See: Web Data Commons.

References

2014

http://commoncrawl.org/the-data/
- The Common Crawl corpus contains petabytes of data collected over 7 years of web crawling. The corpus contains raw web page data, metadata extracts and text extracts. Common Crawl data is stored on Amazon Web Services' Public Data Sets and on multiple academic cloud platforms across the world.
  Access to the Common Crawl corpus hosted by Amazon is free. You may use Amazon's cloud platform to run analysis jobs directly against it or you can download parts or all of it.
http://commoncrawl.org/the-data/get-started/
- The Common Crawl dataset lives on Amazon S3 as part of the Amazon Public Datasets program. From Public Data Sets, you can download the files entirely free using HTTP or S3.
  As the Common Crawl Foundation has evolved over the years, so has the format and metadata that accompany the crawls themselves.

   [ARC]  s3://aws-publicdatasets/common-crawl/crawl-001/ - Crawl #1 (2008/2009)
   [ARC]  s3://aws-publicdatasets/common-crawl/crawl-002/ - Crawl #2 (2009/2010)
   [ARC]  s3://aws-publicdatasets/common-crawl/parse-output/ - Crawl #3 (2012)
   [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-20/ - Summer 2013
   [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-48/ - Winter 2013
   [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-10/ - March 2014
   [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-15/ - April 2014
   [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-23/ - July 2014
   [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-35/ - August 2014

- For all crawls since 2013, the data has been stored in the WARC file format and also contains metadata (WAT) and text data (WET) extracts. Since the April 2014 crawl, we also provide file path lists for the segments, WARC, WAT, and WET files.

Common Crawl Dataset

References

2014

Navigation menu

Search