2004 ApplyingConditionalRandomFieldsToJapMorphAn

Subject Headings: Conditional Random Field Model, Japanese Language, Morphological Analysis Task, CRFpp System, CRF Length Bias, Hierarchical Tag Set.

Notes

This paper presents Japanese morphological analysis based on conditional random fields (CRFs). Previous work in CRFs assumed that observation sequence (word) boundaries were fixed. However, word boundaries are not clear in Japanese, and hence a straightforward application of CRFs is not possible. We show how CRFs can be applied to situations where word boundary ambiguity exists. CRFs offer a solution to the long-standing problems in corpus-based or statistical Japanese morphological analysis. First, flexible feature designs for hierarchical tagsets become possible. Second, influences of label and length bias are minimized. We experiment CRFs on the standard testbed corpus used for Japanese morphological analysis, and evaluate our results using the same experimental dataset as the HMMs and MEMMs previously reported in this task. Our results confirm that CRFs not only solve the long-standing problems but also improve the performance over HMMs and MEMMs.

Word boundary ambiguity cannot be ignored when dealing with non-segmented languages. A simple approach would be to let a character be a token (i.e., character-based Begin/Inside tagging) so that boundary ambiguity never occur (Peng et al., 2004). However, B/I tagging is not a standard method in 20-year history of corpus-based Japanese morphological analysis. This is because B/I tagging cannot directly reflect lexicons which contain prior knowledge about word segmentation. We cannot ignore a lexicon since over 90% accuracy can be achieved even using the longest prefix matching with the lexicon. Moreover, B/I tagging produces a number of redundant candidates which makes the decoding speed slower.
Traditionally in Japanese morphological analysis, we assume that a which lists a pair of a word and its corresponding part-of-speech, is available.

Japanese part-of-speech (POS) tagsets used in the two major Japanese morphological analyzers ChaSen (http://chasen.naist.jp/) and JUMAN(http://www.kc.t.u-tokyo.ac.jp/nl-resource/juman.html) take the form of a hierarchical structure. For example, IPA tagset(http://chasen.naist.jp/stable/ipadic/) used in ChaSen consists of three categories: part-ofspeech, conjugation form (cform), and conjugate type (ctype).

,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2004 ApplyingConditionalRandomFieldsToJapMorphAn	Taku Kudo Kaoru Yamamoto Yuji Matsumoto			Applying Conditional Random Fields to Japanese Morphological Analysis			http://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Kudo.pdf