Training Dataset
(Redirected from training data set)
Jump to navigation
Jump to search
A Training Dataset is a learning dataset of training data records to be used in a supervised ML system.
- AKA: Training Examples.
- Context:
- It can (typically) consist of an Feature Space X and a Target Space y.
- It can be subdivided into a Growing Dataset and a Pruning Dataset.
- It can range from being a Labeled Training Dataset (with labeled training records) to being a Semi-Labeled Training Dataset (with some labeled training records) to being an Unlabeled Training Dataset (with unlabeled training records).
- It can be a task input to a Learning Task.
- It can be a task output to a Training Record Set Creation Task/Training Record Set Creation Task.
- It can (often) be created by a Predictive Modeling Data Preparation Task.
- It can range between being a Numerical Training Dataset to being a Categorical Training Dataset.
- It can range from being a Clean Training Set to being a Noisy Training Set.
- It can range from being a Small Training Set to being a Large Training Set, depending on its training set size.
- It can range from being an Unbiased Training Set to being a Biased Training Set.
- It can range from being a Real-World Training Data to being a Synthetic Training Data.
- It can be a Noisy Training Set, if it is a noisy dataset (with data noise).
- ...
- Example(s)
- a sklearn dataset such as:
- an Annotated Training Dataset (with annotated text items).
- …
- Counter-Example(s):
- See: Cross-Validation Task, Classifier, Overfitting, Dataset, Supervised Machine Learning System, Machine Learning, Training Time.
References
2018
- (Wikipedia, 2018) ⇒ https://en.wikipedia.org/wiki/Training,_test,_and_validation_sets#Training_dataset Retrieved:2018-4-1.
- A training dataset is a dataset of examples used for learning, that is to fit the parameters (e.g., weights) of, for example, a classifier.[1] [2]
Most approaches that search through training data for empirical relationships tend to overfit the data, meaning that they can identify apparent relationships in the training data that do not hold in general.
- A training dataset is a dataset of examples used for learning, that is to fit the parameters (e.g., weights) of, for example, a classifier.[1] [2]
- ↑ Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge: Cambridge University Press, p. 354
- ↑ "Subject: What are the population, sample, training set, design set, validation set, and test set?", Neural Network FAQ, part 1 of 7: Introduction (txt), comp.ai.neural-nets, Sarle, W.S., ed. (1997, last modified 2002-05-17)
2017a
- (Sammut & Webb, 2017) ⇒ (2017) "Training Set". In: Sammut, C., Webb, G.I. (eds) "Encyclopedia of Machine Learning and Data Mining". Springer, Boston, MA
- QUOTE: A training set is a data set containing data that are used for learning by a learning system. A training set may be divided further into a growing set and a pruning set.
2017b
- (Sammut & Webb, 2017) ⇒ (2017) "Data Set". In: Sammut & Webb, 2017.
- QUOTE: A data set is a collection of data used for some specific machine learning purpose. A training set is a data set that is used as input to a learning system, which analyzes it to learn a model. A test set or evaluation set is a data set containing data that are used to evaluate the model learned by a learning system. A training set may be divided further into a growing set and a pruning set. Where the training set and the test set contain disjoint sets of data, the test set is known as a holdout set.
2009
- (Wikipedia, 2009) ⇒ http://en.wikipedia.org/wiki/Training_set
- In artificial intelligence, a training set consists of an input vector and an answer vector, and is used together with a supervised learning method to train a knowledge database (e.g. a neural net or a naive bayes classifier) used by an AI machine.
- In general, the intelligent system consists of a function taking one or more arguments and results in an output vector, and the learning method's task is to run the system once with the input vector as the arguments, calculating the output vector, comparing it with the answer vector and then changing somewhat in order to get an output vector more like the answer vector next time the system is simulated.
2008
- (Wick et al., 2008) ⇒ Michael Wick, Khashayar Rohanimanesh, Karl Schultz, and Andrew McCallum. (2008). “A Unified Approach for Schema Matching, Coreference, and Canonicalization.” In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2008).
- … In the training data we used the first two schemas and in the testing data we used one of the schemas from training... This way we train a model on one schema but test it on another schema. ...
2000
- (Evgeniou et al., 2000) ⇒ Theodorus Evgeniou, Massimiliano Pontil, and Tomaso Poggio. (2000). “Regularization Networks and Support Vector Machines.” In: Advances in Computational Mathematics, 13(1).
- … Vapnik’s theory characterizes and formalizes these concepts in terms of the capacity of a set of functions and capacity control depending on the training data: for instance, for a small training set the capacity of the function space in which [math]\displaystyle{ f }[/math] is sought has to be small whereas it can increase with a larger training set.
- … We are provided with examples of this probabilistic relationship, that is with a data set [math]\displaystyle{ D_l ≡ \{(x_i, y_i) \in X×Y\}^l_{i=1} }[/math] called the training data, obtained by sampling [math]\displaystyle{ l }[/math] times the set [math]\displaystyle{ X × Y }[/math] according to [math]\displaystyle{ P(x, y) }[/math].