Stopword
(Redirected from Stop-Words)
Jump to navigation
Jump to search
A Stopword is a word form that must be excluded from some text processing task.
- AKA: Stop Word.
- Context:
- It can (typically) appear frequently in the Language.
- It can (typically) be a Function Word.
- It can (typically) be a Language-Specific Word (e.g. an English Stopword and a Hindi Stopword).
- It can (often) be excluded from being a Text Vector Feature.
- It can be a part of a Stopword List.
- It can be an input to a Stopword Removal Task.
- Example(s):
- “the”, “a”, “in”, “to” are typical English Stopwords.
- …
- Counter-Example(s)
- a Core Word.
- See: Inverse Document Frequency, Bag-of-Words Representation, Information Retrieval, Phrase Search, Search Engine, Function Word.
References
2015
- (Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/Stop_words Retrieved:2015-2-21.
- In computing, stop words are words which are filtered out before or after processing of natural language data (text). There is no single universal list of stop words used by all processing of natural language tools, and indeed not all tools even use such a list. Some tools specifically avoid removing these stop words to support phrase search. Any group of words can be chosen as the stop words for a given purpose. For some search engines, these are some of the most common, short function words, such as the, is, at, which, and on. In this case, stop words can cause problems when searching for phrases that include them, particularly in names such as 'The Who', 'The The', or 'Take That'. Other search engines remove some of the most common words — including lexical words, such as "want" — from a query in order to improve performance. [1]
Hans Peter Luhn, one of the pioneers in information retrieval, is credited with coining the phrase and using the concept.
- In computing, stop words are words which are filtered out before or after processing of natural language data (text). There is no single universal list of stop words used by all processing of natural language tools, and indeed not all tools even use such a list. Some tools specifically avoid removing these stop words to support phrase search. Any group of words can be chosen as the stop words for a given purpose. For some search engines, these are some of the most common, short function words, such as the, is, at, which, and on. In this case, stop words can cause problems when searching for phrases that include them, particularly in names such as 'The Who', 'The The', or 'Take That'. Other search engines remove some of the most common words — including lexical words, such as "want" — from a query in order to improve performance. [1]
- ↑ Stackoverflow: "One of our major performance optimizations for the "related questions" query is removing the top 10,000 most common English dictionary words (as determined by Google search) before submitting the query to the SQL Server 2008 full text engine. It’s shocking how little is left of most posts once you remove the top 10k English dictionary words. This helps limit and narrow the returned results, which makes the query dramatically faster".
2011
- (Wikipedia, 2011) ⇒ http://en.wikipedia.org/wiki/Stop_words
- QUOTE:In computing, stop words are words which are filtered out prior to, or after, processing of natural language data (text). It is controlled by human input and not automated. There is not one definite list of stop words which all tools use, if even used. Some tools specifically avoid using them to support phrase search.
Any group of words can be chosen as the stop words for a given purpose. For some search machines, these are some of the most common, short function words, such as the, is, at, which and on. In this case, stop words can cause problems when searching for phrases that include them, particularly in names such as 'The Who', 'The The', or 'Take That'. Other search engines remove some of the most common words — including lexical words, such as "want" — from query in order to improve performance.[1]
- QUOTE:In computing, stop words are words which are filtered out prior to, or after, processing of natural language data (text). It is controlled by human input and not automated. There is not one definite list of stop words which all tools use, if even used. Some tools specifically avoid using them to support phrase search.
- ↑ Stackoverflow: "One of our major performance optimizations for the “related questions” query is removing the top 10,000 most common English dictionary words (as determined by Google search) before submitting the query to the SQL Server 2008 full text engine. It’s shocking how little is left of most posts once you remove the top 10k English dictionary words. This helps limit and narrow the returned results, which makes the query dramatically faster."
2005
- (ANSI Z39.19, 2005) ⇒ ANSI. (2005). “ANSI/NISO Z39.19 - Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies." ANSI.
- QUOTE: "stop list A list of words considered to be of no value for retrieval. It consists primarily of function words — articles, conjunctions, and prepositions — but may also include words that occur very frequently in the literature of a domain.
2004
- (Kraft & Zien, 2004) ⇒ Reiner Kraft, and Jason Zien. (2004). “Mining anchor text for query refinement.” In: Proceedings of the 13th International Conference on World Wide Web (WWW 2004). 10.1145/988672.988763
- QUOTE:The window size was used as a parameter for our algorithms. For instance, we investigated the effect of counting stop words (see Appendix B for the list) or not for choosing good refinements by considering the window size parameter. For instance, suppose we choose a window of size two. ... We used the following stop words for anchor text:
ibm web site website websites link next topic domain prev previous page to the for and of an or not a click here - &
- QUOTE:The window size was used as a parameter for our algorithms. For instance, we investigated the effect of counting stop words (see Appendix B for the list) or not for choosing good refinements by considering the window size parameter. For instance, suppose we choose a window of size two. ... We used the following stop words for anchor text: