Overfitting
In statistics and machine learning, the term overfitting refers to excessively complex models that result in undesirable machine learning behavior. In an overfitting scenario, models have learned the random fluctuations and noise from training datasets, resulting in the models handling noise and random fluctuations as if they’re meaningful patterns.
Data scientists train machine learning models using known datasets. Then, the scientists use machine learning models to make predictions. When overfitting occurs, the models perform well and give accurate predictions with the training data but perform poorly when used on new data. In most cases, overfitting is caused by models with high variance, which means that models are too sensitive to the training data’s noise and are unable to generalize well with previously unseen data.
Causes of Overfitting
- Small training data, with not enough data samples for accurately representing all potential input data values
- Too much irrelevant information, often called noisy data
- High model complexity, with the model learning the training data noise
- Overextended training on single data sample sets
Example of Overfitting
Suppose a university is experiencing a higher-than-normal dropout rate. In that case, officials may decide to create a model for predicting the likelihood that applicants will complete the required coursework for graduation. The university trains its model using a dataset of 1,000 applicants and their outcomes. Running the model on the university’s original dataset, the model predicts an outcome with 95% accuracy.
To test the accuracy of its model, the university then runs the model on another dataset with 1,000 additional applicants. In the second test run, the model is only 40% accurate because the model fits too closely to a narrow data subset, the first 1,000 applicants.
Related Terms
Underfitting
Underfitting occurs from bias errors derived from the learning algorithm’s erroneous assumptions. When underfitting occurs, the model lacks the complexity needed for capturing underlying data trends, resulting in poor performance on the test and training sets. Underfitting oversimplifies a problem and fails to capture the data’s complexity.
High Bias/Low Bias
refers to a training model’s tendency to consistently make the same types of errors even with different input data. Models with a high bias pay minimal attention to the training data, thereby oversimplifying the model. This leads to poor performance on the test and training datasets. The terms underfitting and high bias are closely related to machine learning concepts. Alternatively, low bias occurs when a model very closely fits the training data.
High Variance
means models change significantly with small training data changes. In machine learning models, high variance indicates that a model is too sensitive to specific aspects of the training dataset, including outliers and noise.
Low Variance
in machine learning models indicate that the models tend to be somewhat insensitive to the training data’s specifics. Models with low variance usually generalize better but can be prone to underfitting if overly simple.
Poor Generalization
happens when models fail to learn a dataset’s underlying patterns, which results in poor performance on new data.
Deep Learning
is the most commonly used type of machine learning in natural language processing. Powered by neural network layers, the algorithms in deep learning are modeled loosely on the workings of human brains.
Transfer Learning
enables further training for trained deep neural networks. With transfer learning, deep neural networks can achieve new tasks with reduced computing effort and training data.
Pretrained Models
are trained on various combinations of datasets, languages, and pre-training tasks. Users can download pre-trained models and fine-tune them for a wide array of differing target tasks.