Terminology Extraction Task
A terminology extraction task is an information extraction task that is restricted to the population of some portion of a terminology database.
- AKA: Technical Term Acquisition/Mining.
- Context:
- Input: a corpus.
- optional: an information extraction pattern.
- optional: a Threshold, e.g. on the number of terminological units to report.
- output: a Terminological Unit Record set (of Candidate Terms).
- It can be solved by a Terminology Extraction System (that implements a Terminology Extraction algorithm).
- It can be supported by a Term Recognition Task (or by its component Term Detection Task).
- It can be supported by a Synonymous Terms Extraction Task.
- It can range from being a Domain-Specific Terminology Extraction Task to being an Open Terminology Extraction Task.
- It can support:
- It can (typically) be:
- an Unstructured Information Extraction Task.
- an Entity Extraction Task, if the focus is on simple Technical Term Records.
- Input: a corpus.
- Example(s):
- identify the main terminological units in the kdd09cma1 corpus.
- a Synonym Extraction Task.
- a Collocation Extraction Task.
- …
- Counter-Example(s):
- See: Technical Terminology, CompuTerm Workshop, Terminological Analysis.
References
2012
- http://en.wikipedia.org/wiki/Terminology_extraction
- Terminology mining, term extraction, term recognition, or glossary extraction, is a subtask of information extraction. The goal of terminology extraction is to automatically extract relevant terms from a given corpus.
In the semantic web era, a growing number of communities and networked enterprises started to access and interoperate through the internet. Modeling these communities and their information needs is important for several web applications, like topic-driven web crawlers,[1] web services,[2] recommender systems,[3] etc. The development of terminology extraction is essential to the language industry.
One of the first steps to model the knowledge domain of a virtual community is to collect a vocabulary of domain-relevant terms, constituting the linguistic surface manifestation of domain concepts. Several methods to automatically extract technical terms from domain-specific document warehouses have been described in the literature. >
Typically, approaches to automatic term extraction make use of linguistic processors (part of speech tagging, phrase chunking) to extract terminological candidates, i.e. syntactically plausible terminological noun phrases, NPs (e.g. compounds "credit card", adjective-NPs "local tourist information office", and prepositional-NPs "board of directors" - in English, the first two constructs are the most frequent). Terminological entries are then filtered from the candidate list using statistical and machine learning methods. Once filtered, because of their low ambiguity and high specificity, these terms are particularly useful for conceptualizing a knowledge domain or for supporting the creation of a domain ontology. Furthermore, terminology extraction is a very useful starting point for semantic similarity, knowledge management, human translation and machine translation, etc.
- Terminology mining, term extraction, term recognition, or glossary extraction, is a subtask of information extraction. The goal of terminology extraction is to automatically extract relevant terms from a given corpus.
- ↑ Menczer F., Pant G. and Srinivasan P. Topic-Driven Crawlers: machine learning issues.
- ↑ Fan J. and Kambhampati S. A Snapshot of Public Web Services, in ACM SIGMOD Record archive Volume 34 , Issue 1 (March 2005).
- ↑ Yan Zheng Wei, Luc Moreau, Nicholas R. Jennings. A market-based approach to recommender systems, in ACM Transactions on Information Systems (TOIS), 23(3), 2005.
2007
- (Morin et al., 2007) ⇒ Emmanuel Morin, Béatrice Daille, Koichi Takeuchi, and Kyo Kageura. (2007). “Bilingual Terminology Mining-using Brain, Not Brawn Comparable Corpora.” In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics. (ACL 2007).
- QUOTE: Current research in text mining favors the quantity of texts over their representativeness. But for bilingual terminology mining, and for many language pairs, large comparable corpora are not available. More importantly, as terms are defined vis-à-vis a specific domain with a restricted register, it is expected that the representativeness rather than the quantity of the corpus matters more in terminology mining.
2003a
- (Mitkov, 2003) ⇒ Ruslan Mitkov, editor. (2003). “The Oxford Handbook of Computational Linguistics." Oxford University Press. ISBN:019927634X
- QUOTE: term acquisition: Automatic discovery of previously unrecognized terms in corpora. ...
term spotting: Extracting of occurrences of terms or variants from text documents.
- QUOTE: term acquisition: Automatic discovery of previously unrecognized terms in corpora. ...
2003b
- (Navigli et al., 2003) ⇒ Roberto Navigli, Paola Velardi, and Aldo Gangemi. (2003). “Ontology Learning and Its Application to Automated Terminology Translation.” In: IEEE Intelligent Systems, 18(1). doi:10.1109/MIS.2003.1179190
2002
- (Daille, 2002) ⇒ Béatrice Daille. (2002). “Terminology Mining.” In: Proceedings of the Summer Convention on Information Extraction (SCIE 2002). doi:10.1007/b11781
- http://perso.limsi.fr/jacquemi/FASTR/
- Fastr can be used in two modes:
- controlled indexing: input consists of a corpus and a list of terms,
- free indexing: input only consists of a corpus, the list of terms is automatically acquired from the corpus.
- Fastr can be used in two modes:
2001
- (Jacquemin, 2001) ⇒ Christian Jacquemin. (2001). “Spotting and Discovering Terms Through Natural Language Processing." MIT Press. ISBN:0262100851
- (Bourigault et al., 2001) ⇒ Didier Bourigault, Christian Jacquemin, and Marie-Claude L'Homme, editors. (2001). “Recent Advances in Computational Terminology." Volume 2 of Natural Language Processing. John Benjamins Publishing Company. ISBN:1588110168
1997
- (Jacquemin et al., 1997) ⇒ Christian Jacquemin, Judith Klavans, and Evelyne Tzoukermann. (1997). “Expansion of Multi-Word Terms for Indexing and Retrieval Using Morphology and Syntax.” In: Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics (ACL 1997). doi:10.3115/976909.979621.
1992
- (Bourigault, 1992) ⇒ Didier Bourigault. (1992). “Surface Grammatical Analysis for the Extraction of Terminological Noun Phrases.” In: Proceedings of the Fifteenth International Conference on Computational Linguistics.