Written Sentence Boundary Detection Task
(Redirected from Written Sentence Segmentation Task)
Jump to navigation
Jump to search
A Written Sentence Boundary Detection Task is a text segmentation task that is a sentence boundary detection task (which requires the identification of the start and end of linguistic sentences in a text item).
- AKA: Written Sentence Segmentation Task, Sentence Boundary Detection, Text Segmentation, Sentence Segmentation.
- Context:
- It can be treated, in English Written Language, as a Classification Task by labeling the sense of an End-of-Sentence Punctuation as being either an Abbreviation Marker or an End-of-Sentence Marker.
- It can be solved by a Written Sentence Boundary Detection System (that implements a Written Sentence Boundary Detection Algorithm.
- Example(s):
- “I walked home with Ms. Smith. She ate breakfast.” ⇒
<SENT>I walked home with Ms. Smith.</SENT> <SENT>She ate breakfast.</SENT>
- The virA and virG genes control the induction of vir genes by plant signals. virA encodes a membrane-bound sensor kinase protein and virG encodes a cytoplasmic regulator protein. an challenging example where the first letter of a sentence is not capitalized.
- “Plant signal molecules such as acetosyringone and certain monosaccharides induce the expression of Agrobacterium tumefaciens virulence (vir) genes, which are required for the processing, transfer, and possibly integration of a piece of the bacterial plasmid DNA (T-DNA) into the plant genome. Two of the vir genes, virA and virG, belonging to the bacterial two-component regulatory system family, control the induction of vir genes by plant signals. virA encodes a membrane-bound sensor kinase protein and virG encodes a cytoplasmic regulator protein.”
⇒<PSID=8611.0>Plant signal molecules such as acetosyringone and certain monosaccharides induce the expression of Agrobacterium tumefaciens virulence (vir) genes, which are required for the processing, transfer, and possibly integration of a piece of the bacterial plasmid DNA (T-DNA) into the plant genome. <PSID=8611.1>Two of the vir genes, virA and virG, belonging to the bacterial two-component regulatory system family, control the induction of vir genes by plant signals. <PSID=8611.2>virA encodes a membrane-bound sensor kinase protein and virG encodes a cytoplasmic regulator protein.
- “I walked home with Ms. Smith. She ate breakfast.” ⇒
- Counter-Example(s):
- See: PPLRE Project, Full Stop, Sentences.
References
2015
- (Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/Text_segmentation#Sentence_segmentation Retrieved:2015-4-11.
- Sentence segmentation is the problem of dividing a string of written language into its component sentences. In English and some other languages, using punctuation, particularly the full stop character is a reasonable approximation. However even in English this problem is not trivial due to the use of the full stop character for abbreviations, which may or may not also terminate a sentence. For example Mr. is not its own sentence in "Mr. Smith went to the shops in Jones Street." When processing plain text, tables of abbreviations that contain periods can help prevent incorrect assignment of sentence boundaries.
As with word segmentation, not all written languages contain punctuation characters which are useful for approximating sentence boundaries.
- Sentence segmentation is the problem of dividing a string of written language into its component sentences. In English and some other languages, using punctuation, particularly the full stop character is a reasonable approximation. However even in English this problem is not trivial due to the use of the full stop character for abbreviations, which may or may not also terminate a sentence. For example Mr. is not its own sentence in "Mr. Smith went to the shops in Jones Street." When processing plain text, tables of abbreviations that contain periods can help prevent incorrect assignment of sentence boundaries.
1998
- (Mikheev, 1998) ⇒ Andrei Mikheev (1998). “Feature Lattices for Maximum Entropy Modelling.” In: ACL 36.
- Uses a Maximum Entropy approach.
- Attains ~ 99.25% accuracy.
1997
- (Palmer & Hearst, 1997) ⇒ David D Palmer, and Marti Hearst. (1997). “Adaptive Multilingual Sentence Boundary Disambiguation.” In: Computational Linguistics, 23.
- Casts it as a Classification Task. of the sense of a Period: Abbreviation Marker or an End-of-Sentence Marker.
- (Reynar & Ratnaparkhi, 1997) ⇒ Jeffrey C Reynar, and Adwait Ratnaparkhi. (1997). “A Maximum Entropy Approach to Identifying Sentence Boundaries.” In: ANLP 5.
- Uses a Maximum Entropy-based approach.
1994
- (Palmer & Hearst, 1994) ⇒ David D Palmer, and Marti Hearst. (1994). “Adaptive Sentence Boundary Disambiguation.” In: ANLP 4.
- Casts it as a Classification Task. of the sense of a Period: Abbreviation Marker or an End-of-Sentence Marker.
1989
- (Riley, 1989)
- In general, in English, about 90% of periods are sentence boundary indicators.