Text Language Classification Task

AKA: Language Identification.
Context:
- It can be solved by a Text Language Classification System.
Example(s):
- a [math]\displaystyle{ f }[/math](“This sentence is written in English. The entire passage as well.”) ⇒ English Language.
- a [math]\displaystyle{ f }[/math](“Dieser Satz ist in deutscher Sprache.”) ⇒ German Language.
- a [math]\displaystyle{ f }[/math](“Dieser Satz ist in deutscher Sprache. The entire passage is not.”) ⇒ Mixed Language.
- …
Counter-Example(s):
- a Sentiment Classification Task.
See: Natural Language Processing Task.

References

http://en.wikipedia.org/wiki/Language_identification
- Language identification is the process of determining which natural language given content is in. Traditionally, identification of written language - as practiced, for instance, in library science - has relied on manually identifying frequent words and letters known to be characteristic of particular languages. More recently, computational approaches have been applied to the problem, by viewing language identification as a special case of text categorization, a Natural Language Processing approach that relies on statistical methods.

http://alias-i.com/lingpipe/demos/tutorial/langid/read-me.html
- QUOTE: Language identification is the problem of classifying a sample of characters based on its language. This is a critical pre-processing stage in many applications that apply language-specific modeling. For instance, a search engine might use different tokenizers based on the language being stored.

(Cavnar & Trenkle, 1994) ⇒ William B. Cavnar, and John M. Trenkle. (1994). “N-gram-based Text Categorization.” In: Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval.
- QUOTE: We describe here an N-gram-based approach to text categorization that is tolerant of textual errors. The system is small, fast and robust. This system worked very well for language classification, achieving in one test a 99.8 % correct classification rate on Usenet newsgroup articles written in different languages.