Speech Segmentation Task
(Redirected from speech segmentation)
Jump to navigation
Jump to search
A Speech Segmentation Task is a Segmentation Task for Spoken Expressions.
- Context:
- Input: a Spoken Expression.
- output: a Sequence.
- It can be:
- Example(s):
- PWST(I'mcominghome) ⇒ ([I'm] [coming] [home]), an example of PWST.
- PWST(haʊtuɹɛkonaɪzbiːtʃ) ⇒ ([haʊ] [tu] [ɹɛk] [o] [naɪz] [biːtʃ]), i.e. “[how/haʊ][to/tu][wreck/ɹɛk][a/a][nice/naɪz][beach]".
- PWST(haʊtuɹɛkonaɪzbiːtʃ) ⇒ ([haʊ] [tu] [ɹɛkonaɪz] [biːtʃ]), i.e. “[how/haʊ][to/tu][recognize/ɹɛkəgnaɪz][speech/sbiːtʃ]".
- See: Text Segmentation Task, Automatic Speech Recognition Task.
References
- (Wikipedia, 2009) ⇒ http://en.wikipedia.org/wiki/Speech_segmentation
- Speech segmentation is the process of identifying the boundaries between words, syllables, or phonemes in spoken natural languages. The term applies both to the mental processes used by humans, and to artificial processes of natural language processing.
- Speech segmentation is an important subproblem of speech recognition, and cannot be adequately solved in isolation. As in most natural language processing problems, one must take into account context, grammar, and semantics, and even so the result is often a probabilistic division rather than a categorical.
- Phonetic segmentation: The lowest level of speech segmentation is the breakup and classification of the sound signal into a string of phones. The difficulty of this problem is compounded by the phenomenon of co-articulation of speech sounds, where one may be modified in various ways by the adjacent sounds: it may blend smoothly with them, fuse with them, split, or even disappear. This phenomenon may happen between adjacent words just as easily as within a single word.
- Lexical segmentation: In all natural languages, the meaning of a complex spoken sentence (which often has never been heard or uttered before) can be understood only by decomposing it into smaller lexical segments (roughly, the words of the language), associating a meaning to each segment, and then combining those meanings according to the grammar rules of the language. The recognition of each lexical segment in turn requires its decomposition into a sequence of discrete phonetic segments and mapping each segment to one element of a finite set of elementary sounds (roughly, the phonemes of the language); the meaning then can be found by standard table lookup algorithms.
- For most spoken languages, the boundaries between lexical units are surprisingly difficult to identify. One might expect that the inter-word spaces used by many written languages, like English or Spanish, would correspond to pauses in their spoken version; but that is true only in very slow speech, when the speaker deliberately inserts those pauses. In normal speech, one typically finds many consecutive words being said with no pauses between them, and often the final sounds of one word blend smoothly or fuse with the initial sounds of the next word.
- Moreover, an utterance can have different meanings depending on how it is split into words. A popular example, often quoted in the field, is the phrase How to wreck a nice beach, which sounds very similar to How to recognize speech. As this example shows, proper lexical segmentation depends on context and semantics which draws on the whole of human knowledge and experience, and would thus require advanced pattern recognition and artificial intelligence technologies to be implemented on a computer.