Text Processing Task
(Redirected from text pre-processing)
Jump to navigation
Jump to search
A Text Processing Task is a data processing task whose input is a text dataset (of text items).
- AKA: Text Manipulation Task, Text Handling Task, TPT.
- Context:
- Input: Text Item
- optional: processing parameters
- optional: text format specification
- Output: Structured Text Document, Formatted Text, Annotated Text
- Measure: Text Processing Quality Measures
- ...
- It can (typically) require a Text Encoding-Decoding System
- It can (typically) manipulate character sequences
- It can (typically) preserve text structures
- It can (often) handle markup languages
- It can (often) support sequential access
- ...
- It can range from being an Offline Text Processing Task to being an Online Text Processing Task, depending on its processing mode
- It can range from being a Text Preprocessing Task to being a Text Post-Processing Task, depending on its processing stage
- It can range from being a Simple Text Processing Task to being a Complex Text Processing Task, depending on its processing complexity
- It can range from being a Format-Specific Processing Task to being a Format-Agnostic Processing Task, depending on its format dependency
- ...
- It can be solved by a Text Processing System (that implements a text pre-processing algorithm)
- It can require a Text Encoding-Decoding Task
- It can maintain Processing History (for tracking)
- It can produce Processing Results (for evaluation)
- ...
- Input: Text Item
- Example(s):
- Text Manipulation Tasks, such as:
- Document Processing Tasks, such as:
- Text Analysis Tasks, such as:
- ...
- Counter-Example(s):
- Source Code Processing Tasks, which handle programming languages
- Math Notation Processing Tasks, which process mathematical expressions
- Image Processing Tasks, which process visual data
- Speech Processing Tasks, which process audio data
- See: Text Entry Interface, Regular Expression Statement, Text Encoding-Decoding Task, Command-Line Interface (CLI), Graphical User Interface (GUI), PDF-to-Text Conversion Task.
References
2020a
- (Wikipedia, 2020) ⇒ https://en.wikipedia.org/wiki/Text_processing Retrieved:2020-2-16.
- In computing, the term text processing refers to the theory and practice of automating the creation or manipulation of electronic text.
Text usually refers to all the alphanumeric characters specified on the keyboard of the person engaging the practice, but in general text means the abstraction layer immediately above the standard character encoding of the target text.
The term processing refers to automated (or mechanized) processing, as opposed to the same manipulation done manually.
Text processing involves computer commands which invoke content, content changes, and cursor movement, for example to
- search and replace
- format
- generate a processed report of the content of, or
- filter a file or report of a text file.
- The text processing of a regular expression is a virtual editing machine, having a primitive programming language that has named registers (identifiers), and named positions in the sequence of characters comprising the text. Using these the "text processor" can, for example, mark a region of text, and then move it. The text processing of a utility is a filter program, or filter. These two mechanisms comprise text processing.
- In computing, the term text processing refers to the theory and practice of automating the creation or manipulation of electronic text.
2020b
- (Wikipedia, 2020) ⇒ https://en.wikipedia.org/wiki/Text_processing#Definition Retrieved:2020-2-16.
- Since the standardized markup such as ANSI escape codes are generally invisible to the editor, they comprise a set of transitory properties that become at times indistinguishable from word processing. But the definite distinctions from word processing are that text processing proper:
- represents "text processing utilities", not just "text editing" applications.
- is much more "the keyboard way", as opposed to "the mouse way" (e.g. drag and drop, cut and paste) of initiating an edit.
- is sequential access rather than random access in approach.
- operates directly at the presentation layer rather than indirectly at the application layer.
- works raw data that is standardized and works more openly rather than tending towards any proprietary methods.
- In this way markup such as font and color are not really a distinguishing factor, because the character sequences that affect font and color are simply standard characters inserted automatically by a background text processing mode, made to work transparently by compliant text editors, yet becoming otherwise visible as text processing commands when that mode is not in effect. So text processing is defined most basically (but not entirely) around the visual characters (or graphemes) rather than the standard, yet invisible characters.
- Since the standardized markup such as ANSI escape codes are generally invisible to the editor, they comprise a set of transitory properties that become at times indistinguishable from word processing. But the definite distinctions from word processing are that text processing proper: