Text-Token Character-Pattern Predictor Feature
Jump to navigation
Jump to search
A Text-Token Character-Pattern Predictor Feature is a text-token feature that is based on a replacement of a character sets into single symbols.
- Example(s):
GetPattern("G.M.") ⇒ "A-A-"
GetPattern("Machine-223") = "Aaaaaaa-000"
GetCompressedPattern("Machine-223") = "Aa-0"
- Counter-Example(s):
- a Text Token Length Feature.
- a Text Token hasCapitalLetter Feature, such as [math]\displaystyle{ f }[/math](hasCapital("Markov”)) ⇒ 1
- a Text Token Dictionary Match Feature, such as [math]\displaystyle{ f }[/math](equals("Markov”,"Jordan”)) ⇒ 0
- a Character n-Gram Feature, such as [math]\displaystyle{ f }[/math](“rko”, “Markov”) ⇒
true
. - a Text Token Part-of-Speech Role Feature,
- See: Text Token.
References
2007
- (Nadeau & Sekine, 2007) ⇒ David Nadeau, and Satoshi Sekine. (2007). “A Survey of Named Entity Recognition and Classification.” In: Lingvisticae Investigationes, 30(1).
- QUOTE: Pattern features were introduced by M. Collins (2002) and then used by others (W. Cohen & Sarawagi 2004 and B. Settles 2004). Their role is to map words onto a small set of patterns over character types. For instance, a pattern feature might map all uppercase letters to “A”, all lowercase letters to “a”, all digits to “0” and all punctuation to “-”:
x = "G.M.": GetPattern(x) = "A-A-"
x = "Machine-223": GetPattern(x) = "Aaaaaaa-000"
The summarized pattern feature is a condensed form of the above in which consecutive character types are not repeated in the mapped string. For instance, the preceding examples become:
x = "G.M.": GetSummarizedPattern(x) = "A-A-"
;x = "Machine-223": GetSummarizedPattern(x) = "Aa-0"
- QUOTE: Pattern features were introduced by M. Collins (2002) and then used by others (W. Cohen & Sarawagi 2004 and B. Settles 2004). Their role is to map words onto a small set of patterns over character types. For instance, a pattern feature might map all uppercase letters to “A”, all lowercase letters to “a”, all digits to “0” and all punctuation to “-”:
2002
- (Collins, 2002) ⇒ Michael Collins. (2002). “Ranking Algorithms for Named-entity Extraction: Boosting and the Voted Perceptron.” In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. doi:10.3115/1073083.1073165