2004 ApplyingConditionalRandomFieldsToJapMorphAn
Jump to navigation
Jump to search
- (Kudo et al., 2004) ⇒ Taku Kudo, Kaoru Yamamoto, Yuji Matsumoto. (2004). “Applying Conditional Random Fields to Japanese Morphological Analysis.” In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2004).
Subject Headings: Conditional Random Field Model, Japanese Language, Morphological Analysis Task, CRFpp System, CRF Length Bias, Hierarchical Tag Set.
Notes
Cited By
Quotes
Abstract
- This paper presents Japanese morphological analysis based on conditional random fields (CRFs). Previous work in CRFs assumed that observation sequence (word) boundaries were fixed. However, word boundaries are not clear in Japanese, and hence a straightforward application of CRFs is not possible. We show how CRFs can be applied to situations where word boundary ambiguity exists. CRFs offer a solution to the long-standing problems in corpus-based or statistical Japanese morphological analysis. First, flexible feature designs for hierarchical tagsets become possible. Second, influences of label and length bias are minimized. We experiment CRFs on the standard testbed corpus used for Japanese morphological analysis, and evaluate our results using the same experimental dataset as the HMMs and MEMMs previously reported in this task. Our results confirm that CRFs not only solve the long-standing problems but also improve the performance over HMMs and MEMMs.
2 Japanese Morphological Analysis
2.1 Word Boundary Ambiguity
- Word boundary ambiguity cannot be ignored when dealing with non-segmented languages. A simple approach would be to let a character be a token (i.e., character-based Begin/Inside tagging) so that boundary ambiguity never occur (Peng et al., 2004). However, B/I tagging is not a standard method in 20-year history of corpus-based Japanese morphological analysis. This is because B/I tagging cannot directly reflect lexicons which contain prior knowledge about word segmentation. We cannot ignore a lexicon since over 90% accuracy can be achieved even using the longest prefix matching with the lexicon. Moreover, B/I tagging produces a number of redundant candidates which makes the decoding speed slower.
- Traditionally in Japanese morphological analysis, we assume that a which lists a pair of a word and its corresponding part-of-speech, is available.
2.2.1 Hierarchical Tagset
- Japanese part-of-speech (POS) tagsets used in the two major Japanese morphological analyzers ChaSen (http://chasen.naist.jp/) and JUMAN(http://www.kc.t.u-tokyo.ac.jp/nl-resource/juman.html) take the form of a hierarchical structure. For example, IPA tagset(http://chasen.naist.jp/stable/ipadic/) used in ChaSen consists of three categories: part-ofspeech, conjugation form (cform), and conjugate type (ctype).
,
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2004 ApplyingConditionalRandomFieldsToJapMorphAn | Taku Kudo Kaoru Yamamoto Yuji Matsumoto | Applying Conditional Random Fields to Japanese Morphological Analysis | http://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Kudo.pdf |