In order to build reliable machine learning models, the splitting of the data set is essential. The splitting process involves dividing the data set into training, validation, and test sets. This article aims to introduce in detail the concepts of these three collections, data splitting techniques, and pitfalls that can easily occur.
Training set
The training set is used A dataset used for training and enabling the model to learn hidden features/patterns in the data.
In each epoch, the same training data is repeatedly input into the neural network architecture, and the model continues to learn the characteristics of the data.
The training set should have a diverse set of inputs so that the model is trained in all scenarios and can predict possible future data samples.
Validation set
The validation set is a set of data, separate from the training set, used to verify model performance during training.
This validation process provides information that helps tune the model’s hyperparameters and configuration. The model is trained on the training set, while the model is evaluated on the validation set after each epoch.
The main purpose of splitting the data set into a validation set is to prevent the model from overfitting, that is, the model is very good at classifying samples in the training set, but cannot classify unseen samples. Data are generalized and accurately classified.
Test set
The test set is a separate set of data used to test the model after completing training. It provides an unbiased final model performance metric in terms of accuracy, precision, etc. Simply put, the test set reflects the performance of the model.
Creating different samples and splits in the dataset helps to judge the performance of the real model. The dataset splitting rate depends on the number of samples and models present in the dataset.
Common inferences about data set splitting
If there are multiple hyperparameters that need to be adjusted, the machine learning model needs a larger validation set to optimize Model performance. Likewise, if the model has few or no hyperparameters, it can be easily validated using a small set of data.
If the model use case leads to wrong predictions that will seriously affect the model performance, it is best to validate the model after each epoch to let the model learn different scenarios.
As the data dimensions/features increase, the hyperparameters of the neural network function also increase, making the model more complex. In these cases, a large amount of data should be kept in the training set along with the validation set.
1. Random sampling
Random sampling is the oldest and the most popular ways of partitioning a data set. As the name suggests, the dataset is shuffled and samples are randomly picked and placed into training, validation, or test sets based on the percentage given by the user.
However, this approach has an obvious drawback. Random sampling works best on class-balanced datasets, i.e., datasets with approximately the same number of samples in each dataset class. In the case of class-imbalanced datasets, this method of data splitting may introduce bias.
2. Stratified Sampling
Stratified sampling alleviates the random sampling problem in data sets with unbalanced class distributions. The class distribution in each training, validation, and test set can be preserved. Stratified sampling is a fairer way of splitting the data.
3. Cross-validation
Cross-validation or K-Fold cross-validation is a more powerful data splitting technique in which different The sample trains and evaluates the model "K" times.
Expose machine learning models to different data distributions using K-Fold cross-validation. To some extent, the bias that may occur when selecting data in the training and validation sets is mitigated. When using the K-Fold cross-validation scheme, it is common to report mean and standard deviation values.
Therefore, K-Fold cross-validation also has the same problem as random sampling, and the data distribution may be biased. Stratification can be used to maintain the analogy of the data while generating "K" subsets or portions of the data.
1. Using low-quality training data
Since machine learning algorithms are sensitive to training data, even small changes/errors in the training set can cause significant errors in model performance. Therefore the quality of training data is crucial to improving model performance.
2. Overfitting
Overfitting occurs when a machine learning model cannot classify unknown data. Noise or fluctuations in the training data are treated as features and learned by the model. This results in a model that performs well in the training set but poorly in the validation and test sets.
3. Overemphasis on validation and test set metrics
The validation set metric is the metric that determines the model training path. After each epoch, the machine learning model is evaluated on the validation set. Based on the validation set indicators, calculate the corresponding loss term and modify the hyperparameters. Metrics should be chosen so that they have a positive impact on the overall trajectory of model performance.
The above is the detailed content of Data splitting techniques and pitfalls - how to use training set, validation set and test set. For more information, please follow other related articles on the PHP Chinese website!