Training Dataset

A Training Dataset is a learning dataset of training data records to be used in a supervised ML system.

AKA: Training Examples.
Context:
- Task Input: Feature Vectors, Label Vectors
- Task Output: Trained Model
- Task Performance Measure: Dataset Quality Metrics such as:
- ...
- It can (typically) consist of an Feature Space X and a Target Space y.
- It can (typically) be a task input to a Learning Task.
- It can (typically) be a task output from a Training Record Set Creation Task.
- It can (often) be created by a Predictive Modeling Data Preparation Task.
- ...
- It can range from being a Labeled Training Dataset to being an Unlabeled Training Dataset, depending on its labeling status.
- It can range from being a Numerical Training Dataset to being a Categorical Training Dataset, depending on its data type.
- It can range from being a Clean Training Set to being a Noisy Training Set, depending on its data quality.
- It can range from being a Small Training Set to being a Large Training Set, depending on its training set size.
- It can range from being an Unbiased Training Set to being a Biased Training Set, depending on its data bias.
- It can range from being a Real-World Training Data to being a Synthetic Training Data, depending on its data source.
- It can range from being a Balanced Training Dataset to being an Imbalanced Training Dataset, depending on its class distribution.
- It can range from being a Static Training Dataset to being a Dynamic Training Dataset, depending on its update frequency.
- ...
Examples:
- Standard ML Training Datasets, such as:
  - Tabular Datasets, such as:
    - sklearn Boston Dataset for regression tasks.
    - sklearn Diabetes Dataset for regression tasks.
    - sklearn Iris Dataset for classification tasks.
  - Text Datasets, such as:
    - Annotated Training Datasets with annotated text items.
    - IMDB Dataset for sentiment analysis.
    - Reuters News Dataset for text classification.
    - Wiki Text Dataset for language modeling.
- Special Purpose Training Datasets, such as:
  - Domain Specific Datasets for specialized tasks.
  - Synthetic Datasets for rare events.
  - Healthcare Datasets for medical diagnosis.
  - Financial Datasets for fraud detection.
- Image Training Datasets, such as:
  - MNIST Dataset for digit recognition.
  - ImageNet Dataset for object detection.
  - CIFAR-10 Dataset for image classification.
- ...
Counter-Examples:
- Test Dataset, which is used for final model evaluation rather than model training.
- Validation Dataset, which is used for model selection and hyperparameter tuning rather than model training.
- Production Dataset, which is used for real-world inference after training is complete.
See: Cross-Validation Task, Classifier, Overfitting, Dataset, Supervised Machine Learning System, Machine Learning, Training Time.

References

2018

(Wikipedia, 2018) ⇒ https://en.wikipedia.org/wiki/Training,_test,_and_validation_sets#Training_dataset Retrieved:2018-4-1.
- A training dataset is a dataset of examples used for learning, that is to fit the parameters (e.g., weights) of, for example, a classifier.^[1] ^[2]
  Most approaches that search through training data for empirical relationships tend to overfit the data, meaning that they can identify apparent relationships in the training data that do not hold in general.

↑ Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge: Cambridge University Press, p. 354
↑ "Subject: What are the population, sample, training set, design set, validation set, and test set?", Neural Network FAQ, part 1 of 7: Introduction (txt), comp.ai.neural-nets, Sarle, W.S., ed. (1997, last modified 2002-05-17)

2017a

(Sammut & Webb, 2017) ⇒ (2017) "Training Set". In: Sammut, C., Webb, G.I. (eds) "Encyclopedia of Machine Learning and Data Mining". Springer, Boston, MA
- QUOTE: A training set is a data set containing data that are used for learning by a learning system. A training set may be divided further into a growing set and a pruning set.

2017b

(Sammut & Webb, 2017) ⇒ (2017) "Data Set". In: Sammut & Webb, 2017.
- QUOTE: A data set is a collection of data used for some specific machine learning purpose. A training set is a data set that is used as input to a learning system, which analyzes it to learn a model. A test set or evaluation set is a data set containing data that are used to evaluate the model learned by a learning system. A training set may be divided further into a growing set and a pruning set. Where the training set and the test set contain disjoint sets of data, the test set is known as a holdout set.

2009

(Wikipedia, 2009) ⇒ http://en.wikipedia.org/wiki/Training_set
- In artificial intelligence, a training set consists of an input vector and an answer vector, and is used together with a supervised learning method to train a knowledge database (e.g. a neural net or a naive bayes classifier) used by an AI machine.
- In general, the intelligent system consists of a function taking one or more arguments and results in an output vector, and the learning method's task is to run the system once with the input vector as the arguments, calculating the output vector, comparing it with the answer vector and then changing somewhat in order to get an output vector more like the answer vector next time the system is simulated.

2008

(Wick et al., 2008) ⇒ Michael Wick, Khashayar Rohanimanesh, Karl Schultz, and Andrew McCallum. (2008). “A Unified Approach for Schema Matching, Coreference, and Canonicalization.” In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2008).
- … In the training data we used the first two schemas and in the testing data we used one of the schemas from training... This way we train a model on one schema but test it on another schema. ...

2000

(Evgeniou et al., 2000) ⇒ Theodorus Evgeniou, Massimiliano Pontil, and Tomaso Poggio. (2000). “Regularization Networks and Support Vector Machines.” In: Advances in Computational Mathematics, 13(1).
- … Vapnik’s theory characterizes and formalizes these concepts in terms of the capacity of a set of functions and capacity control depending on the training data: for instance, for a small training set the capacity of the function space in which [math]\displaystyle{ f }[/math] is sought has to be small whereas it can increase with a larger training set.
- … We are provided with examples of this probabilistic relationship, that is with a data set [math]\displaystyle{ D_l ≡ \{(x_i, y_i) \in X×Y\}^l_{i=1} }[/math] called the training data, obtained by sampling [math]\displaystyle{ l }[/math] times the set [math]\displaystyle{ X × Y }[/math] according to [math]\displaystyle{ P(x, y) }[/math].

[Ripley,_B.D._1996_p._354-1] Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge: Cambridge University Press, p. 354

[cann-faq-2] "Subject: What are the population, sample, training set, design set, validation set, and test set?", Neural Network FAQ, part 1 of 7: Introduction (txt), comp.ai.neural-nets, Sarle, W.S., ed. (1997, last modified 2002-05-17)

[1]

[2]

Training Dataset

References

2018

2017a

2017b

2009

2008

2000

Navigation menu

Search