Text Segmentation Task
(Redirected from text chunking)
Jump to navigation
Jump to search
A Text Segmentation Task is a text processing task that is a string segmentation task that requires the text annotation of coherent text segments.
- AKA: TST, Text Partitioning Task, Text Division Task.
- Context:
- Input: Digital Text Items
- Output: Segmented Text Items
- Measure: Text Segmentation Performance Measures
- ...
- It can (typically) identify text boundarys
- It can (typically) mark coherent segments
- It can (typically) preserve text structures
- It can (often) maintain segment relations
- It can (often) handle overlapping segments
- ...
- It can range from being a Full Text Segmentation Task to being a Partial Text Segmentation Task, depending on its segmentation coverage
- It can range from being a Syntactic Text Chunking Task to being a Semantic Text Chunking Task, depending on its segmentation type
- It can range from being a Heuristic Text Segmentation Task to being a Data-Driven Text Segmentation Task, depending on its segmentation approach
- It can range from being a Language-Specific Text Segmentation Task to being a Language-Agnostic Text Segmentation Task, depending on its language dependency
- ...
- It can be solved by a Text Segmentation System (that implements a text segmentation algorithm)
- It can support Text Analysis Tasks
- It can maintain Segmentation History (for tracking)
- It can produce Segmentation Results (for evaluation)
- ...
- Examples:
- Syntactic Text Segmentation Tasks, such as:
- Semantic Text Segmentation Tasks, such as:
- Word and Subword Segmentation Tasks, such as:
- Text Word Segmentation Tasks with examples:
- Morph Segmentation Tasks with examples:
- ...
- ...
- Counter-Example(s):
- Handwritten Item Segmentation Tasks, which process physical text
- Software Statement Tokenization Tasks, which handle programming code
- Text Token Tagging Tasks, which assign labels rather than segments
- Speech Segmentation Tasks, which process audio
- DNA Segmentation Tasks, which process genetic sequences
- See: Text Segment, Linguistic Topic, Text Processing Task, Segmentation Task.
References
2022
- (Wikipedia, 2022) ⇒ https://en.wikipedia.org/wiki/Text_segmentation Retrieved:2022-3-21.
- Text segmentation is the process of dividing written text into meaningful units, such as words, sentences, or topics. The term applies both to mental processes used by humans when reading text, and to artificial processes implemented in computers, which are the subject of natural language processing. The problem is non-trivial, because while some written languages have explicit word boundary markers, such as the word spaces of written English and the distinctive initial, medial and final letter shapes of Arabic, such signals are sometimes ambiguous and not present in all written languages.
Compare speech segmentation, the process of dividing speech into linguistically meaningful portions.
- Text segmentation is the process of dividing written text into meaningful units, such as words, sentences, or topics. The term applies both to mental processes used by humans when reading text, and to artificial processes implemented in computers, which are the subject of natural language processing. The problem is non-trivial, because while some written languages have explicit word boundary markers, such as the word spaces of written English and the distinctive initial, medial and final letter shapes of Arabic, such signals are sometimes ambiguous and not present in all written languages.
2005
- (McDonald et al., 2005) ⇒ Ryan McDonald, Koby Crammer, and Fernando Pereira. (2005). “Flexible text segmentation with structured multilabel classification.” In: Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT/EMNLP, 2005).
2000
- (McCallum et al., 2000) ⇒ Andrew McCallum, Dayne Freitag, and Fernando Pereira. (2000). “Maximum Entropy Markov Models for Information Extraction and Segmentation.” In: Proceedings of ICML-2000.
- (Choi, 2000) ⇒ Freddy Y. Y. Choi. (2000). “Advances in Domain Independent Linear Text Segmentation.” In: Proceedings of the 1st North American chapter of the Association for Computational Linguistics Conference.
1999
- (Beeferman et al, 1999) ⇒ Doug Beeferman, Adam Berger, and John D. Lafferty. (1999). “Statistical Models for Text Segmentation.” In: Machine Learning, 34(1–3).
- QUOTE:This paper introduces a new statistical approach to automatically partitioning text into coherent segments. ... Assessment of our approach on quantitative and qualitative grounds demonstrates its effectiveness in two very different domains, Wall Street Journal news articles and television broadcast news story transcripts. Quantitative results on these domains are presented using a new probabilistically motivated error metric, which combines precision and recall in a natural and flexible way. This metric is used to make a quantitative assessment of the relative contributions of the different feature types, as well as a comparison with decision trees and previously proposed text segmentation algorithms.
1988
- (Hobbs et al, 1988) ⇒ Jerry R. Hobbs, Mark Stickel, Paul Martin, and Douglas Edwards. (1988). “Interpretation as Abduction.” In: Proceedings of the 26th annual meeting on Association for Computational Linguistics (ACL 1988).