INEX Wikipedia Corpus

References

http://www.inex.otago.ac.nz/tracks/wiki-mine/wiki-mine.asp
- The INEX XML Wikipedia collection is a marked-up version of the Wikipedia corpus. The mark-up includes named entities and document structure such as document sections, tables and hyperlinks. The classification and clustering tasks use a 144,625 document subset of INEX 2010 collection that has been pre-processed to provide various representations of the documents. Representations are available as a vector space representation of terms, frequent bi-grams, XML tags, trees, links and named entities. The collection is also available in XML format and text-only format.

http://www-connex.lip6.fr/~denoyer/wikipediaXML/
- QUOTE: We propose a XML corpus based on Wikipedia. This corpus can be used in a large variety of XML IR tasks like ad-hoc retrieval, categorization, clustering or Structure Mapping task. This corpus will be used for both, INEX 2007 and the XML Document Mining Challenge. You can find a description of the corpus in this article (published in SIGIR Forum)

(Vercoustre et al., 2008) ⇒ Anne-Marie Vercoustre, James A. Thom, and Jovan Pehcevski. (2008). “Entity Ranking in Wikipedia.” In: Proceedings of the 2008 ACM Symposium on Applied Computing.