2001 SpottingAndDiscovTerms
- (Jacquemin, 2001) ⇒ Christian Jacquemin. (2001). “Spotting and Discovering Terms Through Natural Language Processing.” In: MIT Press. ISBN:0262100851
Subject Headings: FASTR System, Term Variation, Coordination Term Variation, Permutation Term Variation, Modification-Substitution Term Variation, Elision Term Variation, AGROVOC, Regular Expression.
Notes
- Book review:
- Term variation is explicitly described for English through four types of elementary term variation: coordination, permutation, modification/substitution, and elision.
Cited By
Quotes
Book Overview
In this book Christian Jacquemin shows how the power of natural language processing (NLP) can be used to advance text indexing and information retrieval (IR). Jacquemin's novel tool is FASTR, a parser that normalizes terms and recognizes term variants. Since there are more meanings in a language than there are words, FASTR uses a metagrammar composed of shallow linguistic transformations that describe the morphological, syntactic, semantic, and pragmatic variations of words and terms. The acquired parsed terms can then be applied for precise retrieval and assembly of information.
The use of a corpus-based unification grammar to define, recognize, and combine term variants from their base forms allows for intelligent information access to, or "linguistic data tuning" of, heterogeneous texts. FASTR can be used to do automatic controlled indexing, to carry out content-based Web searches through conceptually related alternative query formulations, to abstract scientific and technical extracts, and even to translate and collect terms from multilingual material. Jacquemin provides a comprehensive account of the method and implementation of this innovative retrieval technique for text processing.
Acknowledgements
1 Introduction
1.1 Motivation
1.2 Term Spotting through Term Normalization
1.3 Is In-depth Understanding a Viable Alternative?
1.4 Term Variation: A Central Issue
1.5 Overview of the Study
2 Studies in Term Extraction
(...)
* Depending on whether we work with single-word terms or multi-word terms, the central issues in the design of a term extraction system are very different: * Single-word terms are generally polysemous and call for word-sense disambiguation and context analysis. * Multi-word terms are far less polysemous than single-word terms, but since they have a phrase structure, they are prone to variations. Their identification calls for morphosyntactic analyzers or n-ary statistical measures.
- The following types of pairs are considered: (1) a head noun and its left adjective or noun adjunct, (2) a head noun and the head of its right adjunct, (3) the main verb of a clause and the head of its object phrase, and (4) the head of the subject phrase and the main verb. These types of pairs account for most of the syntactic variants for relating two words (or simple phrases) into pairs carrying compatible semantic content. Page 110
2.1 Basic Concepts and Techniques
2.2 Term Acquisition
2.3 Parsers for Phrase Indexing
2.4 FASTR, Exploiting Term Variation in Term Spotting
3 Terms
3.1 FEASTR Formalism
3.2 Morphology
3.3 Extended Domain of Locality and Lexicalization
3.4 Derivation within FASTR
3.5 Parsing with FASTR
Analysis in FASTR is organized according to the traditional stratified scheme of NLP applications: a morphological step followed by a syntactic step (Alshawi 1992). In addition two new components are necessary for dealing with term extraction: (1) a term preprocessor that allows for the recycling of terms into the grammar and (2) a term variant generator that enriches the term grammar by transforming term rules into term variant rules through the application of metarules. The data flow in FASTR is described by figure 3.7.
3.5.1 Term Recycling
In the experiments reported in this book, two main sources of terms have been used: terms lists exploited in manual indexing (Jacquemin and Royauté 1994) and terms acquired automatically. In the second case, two main possibilities exist: either terms are obtained through a massive initial acquisition (Jacquemin, Klavans, and Tzoukermann 1997), or terms are incrementally acquired through term enrichment from variation (chapter 6).
First, terms in the initial list are morphologically analyzed, and each word belonging to a term receives a set of possibly ambiguous morphological features. Then, terms are syntactically disambiguated according to a generic noun phrase grammar, yielding unambiguously analyzed terms. For example, through this process, out of the two possible taggings of the term Arterial pressure, Adjective-Verb and Adjective-Noun, the second one is chosen and represented as A1, N1H. This structure indicates that the term has an Adjective-Noun structure, that the head word of the structure is the second word, and that the inflectional paradigms are respectively 1 for the adjective[1] and 1 for the head noun[2]
Finally, these term structures, together with the lemmas of the words, are used to generate the final single word and term rules that will be compiled in the application (see rules 3.1 and 3.2 for examples of such rules).
3.52 Morphological Analysis
The second step is morphological analysis of text corpora, which is subdivided into the following procedures (figure 3.8):
(...)
3.6 Summary
4 Variations
4.1 Linguistic Analysis of Term Variations
4.1.1 Transformations
4.1.2 Sublanguages
4.1.3 Knowledge Maintenance
FASTR is designed to recognize occurrences of terms and variants. The requirements for the term recognition procedure are simpler than for a generation procedure. For instance, an analyzer of term variants must reject blood and cell as a variant of Blood cell but should accept blood and bone marrow cell. In comparison, a generation procedure must possess a much finer knowledge of the semantic restrictions involved in the construction of variants. For instance, dorsal and cervical spine is a correct coordination built from Dorsal spine. This term can also be coordinated with other A spine (A=adjective) structures such as Lumbar spine or Lumbosacral spine. But all the A spine terms in which spine does not refer to a nervous spine such as Fish spine or Hedgehog spine cannot be coordinated with Dorsal spine.
Thus, in generation, the constraint of semantic isofopy between coordinated terms with a common head must be considered, whereas it can be ignored in analysis: it is not likely for an occurrence such as hedgehog or dorsal spine to be encountered in a real-world corpus. The relative easiness of term analysis compared with term generation justifies the use of a pure syntactic method for the filtering of correct term variants. The author's first technique for describing term variations was based on acceptability (Jacquemin 1991). However,since for human experts determining acceptability is time-consuming and still does not meet the strict requirements of redefinition techniques, acceptability definition has been replaced by experimental tuning.
4.2 Description of Variations through Metarules
4.3 A Constructive View of Metarules
4.4 Summary
5 Experimental Tuning
5.1 Elementary Variations of Binary Terms
5.2 Elementary Variations of n-ary Terms
5.3 Compositions of Elementary Variations
5.4 Refining Metarules
5.5 Evaluating Syntactic Metarules
5.6 Summary
6 Term Enrichment
6.1 Automatic Thesaurus Acquisition
6.2 Statistical Acquisition from Variations
6.3 Term Enrichment from Variations
6.4 Summary
7 Morphosyntactic Variants
7.1 Morphological Links and Regular Expressions in FASTR
7.2 Evaluation of Morphosyntactic Variant Extraction
7.3 Evaluating Morphosyntactic Metarules
7.4 Summary
8 Semantic Variation and Applications
8.1 Semantic Variation
8.2 Applications
8.3 Other Interesting Directions
9 Conclusion
A Metarule Files
A.1 Paradigmatic Syntactic Metarules
A.2 Filtering Syntactic Metarules
A.3 Paradigmatic Morphosyntactic Metarules
A.4 Filtering Morphosyntactic Metarules
A.5 Filtering Semantic and Morphosemantic Metarules
A.6 Pattern Extractors
B Structured Acquisitions
C Corpus and Term Lists
D Grammar Files
Glossary
- Acquisition graph: An acquisition graph is built by connecting each candidate term to all the terms from which it has been acquired. For instance, if the variant frontal or sphenoid sinus of frontal sinus yields the candidate sphenoid sinus, the acquisition graph has an arc from frontal sinus to sphenoid sinus labeled "coordination."
- Automatic indexing: Automatic indexing is the association of descriptors to documents for the purpose of information retrieval.
- Binary decomposition: A term with more than two content words can generally be decomposed into binary substructures (substructures with two content words). For instance, the binary decomposition of left coronary artery is [left [coronary artery]].
- Candidate term: A candidate term is a term produced by automatic acquisition that has not been manually validated.
- Co-occurence (insertion or permutation ~): A co-occurrence associated with a controlled term [math]\displaystyle{ t }[/math] is a text window that contains its content words w1 and w2. If the order of the words in the co-occurrence is the same as in the controlled term, the co-occurrence is an insertion co-occurrence; otherwise, the co-occurrence is a permutation co-occurrence. Variations are a special case of linguistically motivated co-occurrence.
- Collocation: A collocation denotes a recurring association of two words, and the linguistic links between these two words as well. Collocations are divided into lexical collocation such as doctor and hospital and grammatical collocations such as make and decision.
- Complex (or multi-word) and single-word terms: A complex or multi-word term is a term that contains that least two content words; a term which contains one content words is a single-word term. The term artery is a single-word term, while coronary artery is a complex term.
- Composition of variations (homogeneous and heterogeneous): A composition of variations is the result of the repeated transformations of a controlled term by these variations. The composition is homogeneous if and only if all the variations belong to the same family (coordination, modifications/substitutions, permutations, morphological variations, or semantic variations). The example given for the definition of a transformational path is an heterogeneous composition because it involves a modification/substitution and a coordination.
- Context-free skeleton: The syntactic structure of a rule or a metarule is its context-free skeleton.
- Controlled term: A controlled term is a term that has been manually validated and inserted into a term bank or a thesaurus. Any automatic indexing based on controlled terms is called controlled indexing and free indexing otherwise.
- Coordination variation: A coordination variation is the association of two terms with a common substructure which is factorized in the variant. For example, blood and bone marrow cell is a coordination variant of Blood cell resulting from its coordination with Bone marrow cell. If the common substructure contains the head words, the variation is called an argument coordination. Otherwise, the variation is a head coordination. The preceding example is an argument coordination.
- 'Deconstruction of variation: Coordination and modification/substitution variations involve two terms, but their detection only relies on one controlled term. The deconstruction of a variation of a term [math]\displaystyle{ t }[/math] which involves another term t/' is the extraction of the "hidden" term tl. For instance, the coordination variant abdominal and thoracic artery, computer from the controlled term Abdominal artery, yields through deconstruction the candidate term abdominal artery.
- 'Dependence: In the binary decomposition of a term, dependences are the relationship between heads and arguments. For instance, in the term [left [coronary artery]], coronary depends from artery (in the substructure coronary artery) and left depends from artery (in the upper structure left [coronary artery]).
- 'Elementary variations: The four basic types of syntactic variations (coordination, elision, modification/substitution and permutation) are called elementary variations. They can be compose into complex variations (see composition of variations).
- Elision variation: A elision variation is a deletion of one or more content words but not the head word. For instance, sunflower oil is an elison variant of Sunflower seed oil result from the deletion of seed.
- Endocentricity or exocentricity: A complex lexical entry is endocentric if the complex lexical entry and the head words share the same features (syntactic and semantic features). For instance, bone marrow cell is an endocentric structure (in particular, because a bone marrow cell is a kind-of cell). By contrast, a term such as nose bleed is exocentric.
- Extended co-occurrence: An extended co-occurrence associated with a controlled term [math]\displaystyle{ t }[/math] is a text window tat contains its content words w1 and w2, or one of the morphologically related words, or one of the semantically related words.
- Extended domain of locality: Since tree grammars are not restricted to immediate dependency, complex lexical entries can be described through a lexicalized grammar in which each grammar rule describes the structure of a lexical entry.
- False negative: A false negative is a correct variant wrongly accepted by a negative filtering metarule.
- False positive: A false positive is a incorrect variant wrongly accepted by a positive filtering metarule.
- False drop: A false drop is a correct variant wrongly rejected by a positive filtering metarule.
- Feature structure: A feature structure is a directed graph in which arcs are labeled by features and in which nodes without outgoing arcs are possibly labeled by values.
- Filtering: Filtering is the selection of a subset of data based on statistical or symbolic criteria. For instance, the pattern A N is a symbolic filter that discards blood cell and accepts mononuclear cell.
- 'Finite closure: Given a set of transformations (metarules) and domain (terms), the finite closure of this system is the set of all the images of all the elements in the domain by all the possible finite composition of transformations.
- 'Free or controlled indexing: Controlled indexing is a procedure for automatic indexing in which the descriptors are controlled terms. Otherwise, it is called free indexing.
- Frontier of a tree: A fronter of a ter is the sequence of words or syntactic label obtained by a prefixed depth-first traversal of the tree.
- Harrisian transformation: In Harrisian linguistics, transformations are used to generate the variety of observed linguistic forms from a core set of minimal phrases. These phrases represent the information content of the text in which their transformations are encountered.
- 'Heterocategorical variation: Heterocategorial variations are morphological variations in which the categories of the morphologically related words differ. For instance, the variation that associates Muscle contraction with contracted muscle is heterocategorial.
- Hybrid variation: Hybrid variation are compositions of at least two variations in at least two of the three following categories: morphological variations, semantic variation, or syntactic variations. For instance, the variation that associates Nervous tissue with neural and nonneural human tissues is hyrdi because it is the composition of a syntactic variation (a coordination) and a semantic variation involving the semantic link between nervous and neural.
- 'Isocategorial variation: Isocategorial variation are morphological variations in which the morphologically related words have the same category. For instance, the variation which associates Death certificate with death certification is isocategorial.
- Lexical anchor: The lexical anchor of a complex term is any subset of the set of its content words.
- Lexicalization: A tree grammar is lexicalized if and only if each rule is related with one and only one (single- or multi-word) lexical item.
- 'Metalrule (filtering or paradigmantic ~): A metarule is composed of a context-free skeleton and constraints. The context-free skeleton is a double tree: one for the source and one for the target. The constraints are optional. Metarules without constraints are called paradigmatic metarules; metarules with constraints are called filtering metarules.
- Metarule (source and target of a ~): A metarule is composed of a pair of trees: a source and a target. The transformation of a rule by a metarule is composed of two steps: first the rule is paired with the source of the metarule; then the transformed rule is obtained by copying the target of the metarule. For instance, the source of the metarule Coor(N1>/sub> -> A2 N 3) == N1>/sub> -> A2 C4 A5 N3 can be paired with the term Cerebral artery and produces a variant rule cerebral C A artery.
- Modification variation: A modification variation is the insertion of a modifier in the structure of a controlled term. For instance, automatica image analysis is a modification variant of automatic analysis resulting of the modification of analysis by image. In this case, image analysis is not a controlled term ; otherwise, the variation would be asubstitution variation.
- 'Morphological family: Two noncompound lemmas belong to the same morphological if they have a morphological link with the same root words. For instance, illiteracy and literateness are in the same morphological family because they share the same root lemma literate.
- Morphological link: There is a morphological link from lemma [math]\displaystyle{ l }[/math] to lemma l if [math]\displaystyle{ l }[/math] is (one of) the root lemma(s) of l. For instance, there is a morphological link from literate to illiteracy.
- Morphological variation: A morphological variation is the transformation of one of the content words into a words in the same morphological family. For example, abused intravenous drugs is a noun to verb variant of Drug abuse.
- n-gram language model: In the statistical analysis of a document, the model of language is an n-gram model if the elementary events are n-uples of words. If [math]\displaystyle{ n }[/math] equals 2 or 3, the model is said bi- or tri-gram.
- Paradigmatic or filtered variants: Paradigmatic variants are extracted through paradigmatic metarules and filtered variants through filtering metarules that contains additional constraints.
- 'Pattern extractor: A variant deconstruction is performed by extracting a syntactic patter from a variant. The pattern is closely related to the structure of the metarule that has produced the variant.
- Permutation variation: A permutation variation is the transformation of a structure of Germanic compounding such as Growth rate into a noun phrase with a post-posed prepositional modifier such as rate of growth.
- Postfiltering: The candidate terms produced through /variant deconstruction are called pre-candidate terms. Postfiltering is the process through which pre-candidate terms are filtered in order to retain candidate terms.
- 'Precision: The precision of controlled indexing is the proportion of correct indexes among the indexes of the documents.
- 'Precision of fallout: The precision of fallout in controlled indexing is the proportion of correct indexes among the rejected occurrences.
- Prefixed variation: Prefixed variations are morphological variations in which the morphological relation involves prefixing. For instance, the variation that associates Chemical exchange with chemical change is prefixed.
- Ratio (adjacency or dependency ~): Structured disambiguation relies on the computation of correct substructures through statistical measure of co-occurrence. Adjacency and dependency ratios are two measures used for the computation.
- Recall: The recall in controlled indexing is the proportion of extracted indexes among the correct indexes.
- Relative frequency: The relative frequency of a term in a document is the ratio of its number of occurrences to the size of the document.
- 'Semantic class: Two lemmas belong to the same semantic class if they have a semantic link with the same lemma. For instance, refinement and intricacy are in the same semantic class because thy booth have a semantic ling with elaboration.
- 'Semantic link: For instance, there is a semantic link from elaboration to refinement.
- Semantic variation: A semantic variation is the transformation of one of the content words into awords in the same semantic family. For example, recurrence rate is a semantic variant of Return rate.
- Subterm: In a term structure, any substructure that is a controlled term is a subterm. For instance, if Bone marrow is a controlled ter, it is a subterm of ''bone marrow'' cell].
- Statistical measure of co-occurrence: A statistical measure of co-occurrence is a statistical measure of the co-occurrences of two or more words in a fixed-length text window.
- Structural ambiguity: A term is structurally ambiguous if it accepts several substructures with respect to a grammar of terms. For instance, given a grammar of noun phrases (NP -> NP N, NP -> N NP, NP -> N, N -> bone, N -> marrow, N -> cell), bone marrow cell is ambiguous because it accepts none marroe and marrow cell as substructures.
- Structural disambiguation: If a term is structurally ambiguous, the structural disambiguation of term is the selection of the substructures that are linguistically plausible.
- Substitution variation: A substitution variation is the replacements of a substantive words [math]\displaystyle{ w }[/math] by a controlled term whose head words is w. For instance, congenital heart disease is a substitution variant of Congenital disease in which the head words disease is substituted by the term heart disease. If the substituted phrase is not a term, the variation is a modification variant.
- Syntactic pattern: A syntactic patterns is a tree structure in which leaf nodes are lexemes or syntactic categories. By extension, a syntactic pattern is also the fronteier of such as tree structure.
- Term recycling: Terminological data are lexical database that can be used to build electronic dictionaries for the purpose of natural language processing. Term recycling is the transformation and the exploitation of such a term base in natural language processing.
- Terminography: Terminography refers to the study, the collection, the classification, the definition, and the conceptual classification of terms.
- 'Text coverage: The text coverage of a linguistic phenomenon such as term variation is the proportion of a given text that is covered by this phenomenon.
- Text window: A text window is a sequence of [math]\displaystyle{ n }[/math] consecutive words in a document. For instance, in this sentence is a 3-words window in this sentence.
- 'Train or test corpus: The training corpus is used for tuning a set of metarules for a given language. The test corpus is used for the evaluation of a set of metarules for a given set of terms and a given language.
- 'Transformational path: A transformation path of a nonelementary variation is a sequence of elementary variations whose composition is equal to this variation. For instance, a transformational path corresponding to,
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2001 SpottingAndDiscovTerms | Christian Jacquemin | Spotting and Discovering Terms Through Natural Language Processing | MIT Press | http://books.google.com/books?id=W6AB06SBAGMC | 2001 |