Noisy Dataset
(Redirected from noisy dataset)
Jump to navigation
Jump to search
A Noisy Dataset is a dataset whose data records contain measurement error (or measurement uncertainty).
- AKA: Uncertain Information.
- Context:
- It can range from being a Noisy Numerical Dataset to being a Noisy Nominal Dataset to being a Noisy Hybrid Dataset.
- It can range from being a Static Noisy Dataset to being a Temporal Noisy Dataset.
- Example(s):
- a Noisy Image Dataset, possibly with salt-and-pepper noise.
- a Noisy Text Dataset.
- …
- Counter-Example(s):
- See: Noise, Sensor Error, Data Scrubbing Task, Missing-not-at-Random Data, Kalman Filter.
References
2017
- (Sammut & Webb, 2017) ⇒ Claude Sammut, and Geoffrey I. Webb. (2017). “Noise”. In: (Sammut & Webb, 2017) DOI:10.1007/978-1-4899-7687-1_957.
- QUOTE: The training data for a learning algorithm is said to be noisy if the data contain errors. Errors can be of two types:
- A measurement error occurs when some attribute values are incorrect or inaccurate. Note that measurement of physical properties by continuous values is always subject to some error.
- In supervised learning, classification error means that a training example has an incorrect class label.
- QUOTE: The training data for a learning algorithm is said to be noisy if the data contain errors. Errors can be of two types:
- In addition to errors, training examples may have missing attribute values. That is, the values of some attribute values are not recorded.
Noisy data can cause learning algorithms to fail to converge to a concept description or to build a concept description that has poor classification accuracy on unseen examples. This is often due to overfitting
- In addition to errors, training examples may have missing attribute values. That is, the values of some attribute values are not recorded.
2009
- (Hu et al., 1999) ⇒ Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou. (2009). “Exploiting Wikipedia as External Knowledge for Document Clustering.” In: Proceedings of ACM SIGKDD Conference (KDD-2009). doi:10.1145/1557019.1557066
- … There are two major issues for this approach: (1) the coverage of the ontology is limited, even for WordNet or Mesh, (2) using ontology terms as replacement or additional features may cause information loss, or introduce noise.
2008
- (Wilson, 2008a) ⇒ Bill Wilson. (2008). “The Machine Learning Dictionary for COMP9414." University of New South Wales, Australia.
- QUOTE: noisy data in machine learning: The term "noise" in this context refers to errors in the training data for machine learning algorithms. If a problem is difficult enough and complicated enough to be worth doing with machine learning techniques, then any reasonable training set is going to be large enough that there are likely to be errors in it. This will of course cause problems for the learning algorithm. See also: decision tree pruning and generalization in backprop.