Cross-Language Document Categorization Task

References

(Franco-Salvador et al., 2014) ⇒ Franco-Salvador, M., Rosso, P., & Navigli, R. (2014, April). A Knowledge-based Representation for Cross-Language Document Retrieval and Categorization. In EACL (Vol. 14, pp. 414-423).
- Abstract: Current approaches to cross-language doc-ument retrieval and categorization are based on discriminative methods which represent documents in a low-dimensional vector space. In this paper we pro-pose a shift from the supervised to the knowledge-based paradigm and provide a document similarity measure which draws on BabelNet, a large multilingual knowledge resource. Our experiments show state-of-the-art results in cross-lingual document retrieval and categorization.

(Guo & Xiao, 2012) ⇒ Guo, Y., & Xiao, M. (2012). Cross language text classification via subspace co-regularized multi-view learning. arXiv preprint arXiv:1206.6481.
- Abstract: In many multilingual text classification problems, the documents in different languages often share the same set of categories. To reduce the labeling cost of training a classification model for each individual language, it is important to transfer the label knowledge gained from one language to another language by conducting cross language classification. In this paper we develop a novel subspace co-regularized multi-view learning method for cross language text classification. This method is built on parallel corpora produced by machine translation. It jointly minimizes the training error of each classifier in each language while penalizing the distance between the subspace representations of parallel documents. Our empirical study on a large set of cross language text classification tasks shows the proposed method consistently outperforms a number of inductive methods, domain adaptation methods, and multi-view learning methods.

(Sammut & Webb, 2011) ⇒ Claude Sammut (editor), and Geoffrey I. Webb (editor). (2011). “Cross-Language Document Categorization.” In: (Sammut & Webb, 2011) p.242
- Document Categorization is the task consisting in assigning a document to zero, one or more categories in a predefined taxonomy. Cross-language document categorization describes specific case in which one is interested in automatically categorize a document in a same taxonomy regardless of the fact that the document is written in one of several languages. For more details on the methods used to perform this task see cross-lingual text mining.