Google Research's WikiLinks Dataset

See: Cross-Document Coreference Resolution, Annotated Corpus, Wikipedia-based Dataset.

References

(Orr, Subramanya & Pereira, 2013) ⇒ Dave Orr, Amar Subramanya, and Fernando Pereira. (2013). “Learning from Big Data: 40 Million Entities in Context.” In: Google Research Blog
- When someone mentions Mercury, are they talking about the planet, the god, the car, the element, Freddie, or one of some 89 other possibilities? This problem is called disambiguation (a word that is itself ambiguous), and while it’s necessary for communication, and humans are amazingly good at it (when was the last time you confused a fruit with a giant tech company?), computers need help.
  To provide that help, we are releasing the Wikilinks Corpus: 40 million total disambiguated mentions within over 10 million web pages -- over 100 times bigger than the next largest corpus (about 100,000 documents, see the table below for mention and entity counts). The mentions are found by looking for links to Wikipedia pages where the anchor text of the link closely matches the title of the target Wikipedia page. If we think of each page on Wikipedia as an entity (an idea we’ve discussed before), then the anchor text can be thought of as a mention of the corresponding entity.

http://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/wiki-links/README.txt
- QUOTE: The Wikipedia links (WikiLinks) data consists of web pages that contain at least one hyperlink that points to English Wikipedia. The data set was obtained by iterating over Google's web index. We treat each page on Wikipedia as representing an entity (or concept or idea), and the anchor text as a mention of that entity. We have done some filtering to ensure that the anchor text can be a mention of the entity that it links to (e.g., we remove anchors such as

$ awk '{print $1}' data-00003-of-00010 | sort | uniq -c
 2,175,992
 4,049,295 MENTION
10,911,908 TOKEN
  1087,996 URL