INEX Wikipedia Corpus

From GM-RKB
Jump to navigation Jump to search

An INEX Wikipedia Corpus is a Wikipedia data snapshot that is a marked-up version of a Wikipedia corpus.



References

2010

  • http://www.inex.otago.ac.nz/tracks/wiki-mine/wiki-mine.asp
    • The INEX XML Wikipedia collection is a marked-up version of the Wikipedia corpus. The mark-up includes named entities and document structure such as document sections, tables and hyperlinks. The classification and clustering tasks use a 144,625 document subset of INEX 2010 collection that has been pre-processed to provide various representations of the documents. Representations are available as a vector space representation of terms, frequent bi-grams, XML tags, trees, links and named entities. The collection is also available in XML format and text-only format.

2007

  • http://www-connex.lip6.fr/~denoyer/wikipediaXML/
    • QUOTE: We propose a XML corpus based on Wikipedia. This corpus can be used in a large variety of XML IR tasks like ad-hoc retrieval, categorization, clustering or Structure Mapping task. This corpus will be used for both, INEX 2007 and the XML Document Mining Challenge. You can find a description of the corpus in this article (published in SIGIR Forum)

2008