Noisy Dataset

AKA: Uncertain Information.
Context:
- It can range from being a Noisy Numerical Dataset to being a Noisy Nominal Dataset to being a Noisy Hybrid Dataset.
- It can range from being a Static Noisy Dataset to being a Temporal Noisy Dataset.
Example(s):
- a Noisy Image Dataset, possibly with salt-and-pepper noise.
- a Noisy Text Dataset.
- …
Counter-Example(s):
- Clean Dataset.
- Representative Dataset.
See: Noise, Sensor Error, Data Scrubbing Task, Missing-not-at-Random Data, Kalman Filter.

References

In addition to errors, training examples may have missing attribute values. That is, the values of some attribute values are not recorded.

(Hu et al., 1999) ⇒ Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou. (2009). “Exploiting Wikipedia as External Knowledge for Document Clustering.” In: Proceedings of ACM SIGKDD Conference (KDD-2009). doi:10.1145/1557019.1557066
- … There are two major issues for this approach: (1) the coverage of the ontology is limited, even for WordNet or Mesh, (2) using ontology terms as replacement or additional features may cause information loss, or introduce noise.

(Wilson, 2008a) ⇒ Bill Wilson. (2008). “The Machine Learning Dictionary for COMP9414." University of New South Wales, Australia.
- QUOTE: noisy data in machine learning: The term "noise" in this context refers to errors in the training data for machine learning algorithms. If a problem is difficult enough and complicated enough to be worth doing with machine learning techniques, then any reasonable training set is going to be large enough that there are likely to be errors in it. This will of course cause problems for the learning algorithm. See also: decision tree pruning and generalization in backprop.