Sentence Boundary Detection (SBD) Task
(Redirected from Sentence boundary disambiguation)
Jump to navigation
Jump to search
A Sentence Boundary Detection (SBD) Task is a text segmentation task that requires the segmentation of a linguistic expression into its component natural language sentences.
- AKA: End-of-Sentence Detection, Sentence Splitting.
- Context:
- Input: a Text Item.
- output: a Segmented Text Item demarcated by sentences.
- measure: SBD Performance Measures, such as: Precision and Recall, F1 Score, and Accuracy.
- It can range from being a Rule-based Sentence Boundary Detection Task to being a Data-driven Boundary Detection Task (such as a Supervised Sentence Boundary Detection Task).
- It can range from being a Written Sentence Boundary Detection Task to being a Spoken Sentence Boundary Detection Task.
- It can be solved by a Sentence Boundary Detection System (that implements a Sentence Boundary Detection Algorithm).
- It may involve challenges related to Punctuation Mark ambiguity, such as distinguishing between periods that denote sentence ends and those used in abbreviations or decimal points.
- Example(s):
- SBD("I saw E. coli under the microscope with Dr. Smith. They were moving.”) ⇒
<SENT>I saw E. coli under the microscope with Dr. Smith.</SENT> <SENT>They were moving.</SENT>
- SBD("Dr. Jones, the principal, will speak at 5 p.m. today.") ⇒
<SENT>Dr. Jones, the principal, will speak at 5 p.m. today.</SENT>
- ...
- SBD("I saw E. coli under the microscope with Dr. Smith. They were moving.”) ⇒
- Counter-Example(s):
- See: Punctuation Mark, Full Stop, Abbreviation, Decimal Point, Ellipsis.
References
2024
- (Wikipedia, 2024) ⇒ https://en.wikipedia.org/wiki/Sentence_boundary_disambiguation Retrieved:2024-3-25.
- Sentence boundary disambiguation (SBD), also known as sentence breaking, sentence boundary detection, and sentence segmentation, is the problem in natural language processing of deciding where sentences begin and end. Natural language processing tools often require their input to be divided into sentences; however, sentence boundary identification can be challenging due to the potential ambiguity of punctuation marks. In written English, a period may indicate the end of a sentence, or may denote an abbreviation, a decimal point, an ellipsis, or an email address, among other possibilities. About 47% of the periods in The Wall Street Journal corpus denote abbreviations. Question marks and exclamation marks can be similarly ambiguous due to use in emoticons, computer code, and slang.
Some languages including Japanese and Chinese have unambiguous sentence-ending markers.
- Sentence boundary disambiguation (SBD), also known as sentence breaking, sentence boundary detection, and sentence segmentation, is the problem in natural language processing of deciding where sentences begin and end. Natural language processing tools often require their input to be divided into sentences; however, sentence boundary identification can be challenging due to the potential ambiguity of punctuation marks. In written English, a period may indicate the end of a sentence, or may denote an abbreviation, a decimal point, an ellipsis, or an email address, among other possibilities. About 47% of the periods in The Wall Street Journal corpus denote abbreviations. Question marks and exclamation marks can be similarly ambiguous due to use in emoticons, computer code, and slang.
2015
- (Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/Sentence_boundary_disambiguation Retrieved:2015-4-11.
- Sentence boundary disambiguation (SBD), also known as sentence breaking, is the problem in natural language processing of deciding where sentences begin and end. Often natural language processing tools require their input to be divided into sentences for a number of reasons. ...
2011
- (Wikipedia, 2011) ⇒ http://en.wikipedia.org/wiki/Text_segmentation#Sentence_segmentation
- QUOTE: Sentence segmentation is the problem of dividing a string of written language into its component sentences. In English and some other languages, using punctuation, particularly the full stop character is a reasonable approximation. However even in English this problem is not trivial due to the use of the full stop character for abbreviations, which may or may not also terminate a sentence. For example Mr. is not its own sentence in "Mr. Smith went to the shops in Jones Street.” When processing plain text, tables of abbreviations that contain periods can help prevent incorrect assignment of sentence boundaries.
As with word segmentation, not all written languages contain punctuation characters which are useful for approximating sentence boundaries.
- QUOTE: Sentence segmentation is the problem of dividing a string of written language into its component sentences. In English and some other languages, using punctuation, particularly the full stop character is a reasonable approximation. However even in English this problem is not trivial due to the use of the full stop character for abbreviations, which may or may not also terminate a sentence. For example Mr. is not its own sentence in "Mr. Smith went to the shops in Jones Street.” When processing plain text, tables of abbreviations that contain periods can help prevent incorrect assignment of sentence boundaries.