Overfitting

Last Modified

Apr 17, 2024

Causes of Overfitting

Small training data, with not enough data samples for accurately representing all potential input data values
Too much irrelevant information, often called noisy data
High model complexity, with the model learning the training data noise
Overextended training on single data sample sets

Example of Overfitting

Suppose a university is experiencing a higher-than-normal dropout rate. In that case, officials may decide to create a model for predicting the likelihood that applicants will complete the required coursework for graduation. The university trains its model using a dataset of 1,000 applicants and their outcomes. Running the model on the university’s original dataset, the model predicts an outcome with 95% accuracy.

To test the accuracy of its model, the university then runs the model on another dataset with 1,000 additional applicants. In the second test run, the model is only 40% accurate because the model fits too closely to a narrow data subset, the first 1,000 applicants.

Related Terms

Underfitting

Underfitting occurs from bias errors derived from the learning algorithm’s erroneous assumptions. When underfitting occurs, the model lacks the complexity needed for capturing underlying data trends, resulting in poor performance on the test and training sets. Underfitting oversimplifies a problem and fails to capture the data’s complexity.

High Bias/Low Bias

refers to a training model’s tendency to consistently make the same types of errors even with different input data. Models with a high bias pay minimal attention to the training data, thereby oversimplifying the model. This leads to poor performance on the test and training datasets. The terms underfitting and high bias are closely related to machine learning concepts. Alternatively, low bias occurs when a model very closely fits the training data.

High Variance

means models change significantly with small training data changes. In machine learning models, high variance indicates that a model is too sensitive to specific aspects of the training dataset, including outliers and noise.

Low Variance

in machine learning models indicate that the models tend to be somewhat insensitive to the training data’s specifics. Models with low variance usually generalize better but can be prone to underfitting if overly simple.

Poor Generalization

happens when models fail to learn a dataset’s underlying patterns, which results in poor performance on new data.

Deep Learning

is the most commonly used type of machine learning in natural language processing. Powered by neural network layers, the algorithms in deep learning are modeled loosely on the workings of human brains.

Transfer Learning

enables further training for trained deep neural networks. With transfer learning, deep neural networks can achieve new tasks with reduced computing effort and training data.

Pretrained Models

are trained on various combinations of datasets, languages, and pre-training tasks. Users can download pre-trained models and fine-tune them for a wide array of differing target tasks.

Back to Main Glossary