Google Research's WikiLinks Dataset
(Redirected from Wikilinks dataset)
Jump to navigation
Jump to search
The Google Research's WikiLinks Dataset is an Annotated Dataset of in-links to Wikipedia Entity Page.s
References
2013
- (Orr, Subramanya & Pereira, 2013) ⇒ Dave Orr, Amar Subramanya, and Fernando Pereira. (2013). “Learning from Big Data: 40 Million Entities in Context.” In: Google Research Blog
- When someone mentions Mercury, are they talking about the planet, the god, the car, the element, Freddie, or one of some 89 other possibilities? This problem is called disambiguation (a word that is itself ambiguous), and while it’s necessary for communication, and humans are amazingly good at it (when was the last time you confused a fruit with a giant tech company?), computers need help.
To provide that help, we are releasing the Wikilinks Corpus: 40 million total disambiguated mentions within over 10 million web pages -- over 100 times bigger than the next largest corpus (about 100,000 documents, see the table below for mention and entity counts). The mentions are found by looking for links to Wikipedia pages where the anchor text of the link closely matches the title of the target Wikipedia page. If we think of each page on Wikipedia as an entity (an idea we’ve discussed before), then the anchor text can be thought of as a mention of the corresponding entity.
- When someone mentions Mercury, are they talking about the planet, the god, the car, the element, Freddie, or one of some 89 other possibilities? This problem is called disambiguation (a word that is itself ambiguous), and while it’s necessary for communication, and humans are amazingly good at it (when was the last time you confused a fruit with a giant tech company?), computers need help.
- http://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/wiki-links/README.txt
- QUOTE: The Wikipedia links (WikiLinks) data consists of web pages that contain at least one hyperlink that points to English Wikipedia. The data set was obtained by iterating over Google's web index. We treat each page on Wikipedia as representing an entity (or concept or idea), and the anchor text as a mention of that entity. We have done some filtering to ensure that the anchor text can be a mention of the entity that it links to (e.g., we remove anchors such as
2012
- http://www.iesl.cs.umass.edu/data/wiki-links
- Cross-document coreference resolution is the task of grouping the entity mentions in a collection of documents into sets that each represent a distinct entity. It is central to knowledge base construction and also useful for joint inference with other NLP components. Obtaining large, organic labeled datasets for training and testing cross-document coreference has previously been difficult. We use a method for automatically gathering massive amounts of naturally-occurring cross-document reference data to create the Wikilinks dataset comprising of 40 million mentions over 3 million entities. Our method is based on finding hyperlinks to Wikipedia from a web crawl and using anchor text as mentions. In addition to providing large-scale labeled data without human effort, we are able to include many styles of text beyond newswire and many entity types beyond people.
... - Number of Mentions: 40,323,863
- Number of Entities: 2,933,659
- Number of pages: 10,893,248
- Cross-document coreference resolution is the task of grouping the entity mentions in a collection of documents into sets that each represent a distinct entity. It is central to knowledge base construction and also useful for joint inference with other NLP components. Obtaining large, organic labeled datasets for training and testing cross-document coreference has previously been difficult. We use a method for automatically gathering massive amounts of naturally-occurring cross-document reference data to create the Wikilinks dataset comprising of 40 million mentions over 3 million entities. Our method is based on finding hyperlinks to Wikipedia from a web crawl and using anchor text as mentions. In addition to providing large-scale labeled data without human effort, we are able to include many styles of text beyond newswire and many entity types beyond people.
$ awk '{print $1}' data-00003-of-00010 | sort | uniq -c 2,175,992 4,049,295 MENTION 10,911,908 TOKEN 1087,996 URL