Predictive Modeling Data Preparation Task
Jump to navigation
Jump to search
A Predictive Modeling Data Preparation Task is a data preparation task for a predictive modeling task.
- Context:
- It can be solved by a Predictive Modeling Data Preparation System.
- It can (often) produce a Training Dataset, a Validation Dataset, and a Test Dataset.
- Example(s):
- AWS-based Predictive Modeling Data Preparation Task, e.g. using S3 and EMR.
- …
- Counter-Example(s):
- …
- See: Data Processing.
References
2018
- https://cloud.google.com/inclusive-ml/#data-guidelines
- QUOTE: Let’s start with the first step in AutoML: putting together your training data. While no training data will be perfectly “unbiased”, you can greatly improve your chances of building a better, more inclusive product if you carefully consider potential sources of bias in your data and take steps to address them.
2018
- https://cloud.google.com/vision/automl/docs/beginners-guide
- QUOTE: ... The data in a dataset is divided into three datasets when training a model: a training dataset, a validation dataset, and a test dataset.
- A training dataset is used to build a model. The model tries multiple algorithms and parameters while searching for patterns in the training data. As the model identifies patterns, it uses the validation dataset to test the algorithms and patterns. The best performing algorithms and patterns are chosen from those identified during the training stage.
- After the best performing algorithms and patterns have been identified, they are tested for error rate, quality, and accuracy using the test dataset.
- Both a validation and a test dataset are used in order to avoid bias in the model. During the validation stage, optimal model parameters are used, which can result in biased metrics. Using the test dataset to assess the quality of the model after the validation stage provides an unbiased assessment of the quality of the model.
- QUOTE: ... The data in a dataset is divided into three datasets when training a model: a training dataset, a validation dataset, and a test dataset.