VidCruiter Logo

Overfitting

Written by

Tiffany Clark

Reviewed by

VidCruiter Editorial Team

Last Modified

Apr 17, 2024
Overfitting
Left Arrow Icon Back to Main Glossary

SHARE THIS

  • LinkedIn
  • Twitter
  • Facebook
  • URL copied to clipboard!

SUBSCRIBE TO OUR NEWSLETTER

In statistics and machine learning, the term overfitting refers to excessively complex models that result in undesirable machine learning behavior. In an overfitting scenario, models have learned the random fluctuations and noise from training datasets, resulting in the models handling noise and random fluctuations as if they’re meaningful patterns. 

 

Data scientists train machine learning models using known datasets. Then, the scientists use machine learning models to make predictions. When overfitting occurs, the models perform well and give accurate predictions with the training data but perform poorly when used on new data. In most cases, overfitting is caused by models with high variance, which means that models are too sensitive to the training data’s noise and are unable to generalize well with previously unseen data. 

 

Causes of Overfitting

 

  • Small training data, with not enough data samples for accurately representing all potential input data values

  • Too much irrelevant information, often called noisy data

  • High model complexity, with the model learning the training data noise

  • Overextended training on single data sample sets

 

Example of Overfitting

 

Suppose a university is experiencing a higher-than-normal dropout rate. In that case, officials may decide to create a model for predicting the likelihood that applicants will complete the required coursework for graduation. The university trains its model using a dataset of 1,000 applicants and their outcomes. Running the model on the university’s original dataset, the model predicts an outcome with 95% accuracy. 

 

To test the accuracy of its model, the university then runs the model on another dataset with 1,000 additional applicants. In the second test run, the model is only 40% accurate because the model fits too closely to a narrow data subset, the first 1,000 applicants.

 

Related Terms

Underfitting

Underfitting occurs from bias errors derived from the learning algorithm’s erroneous assumptions. When underfitting occurs, the model lacks the complexity needed for capturing underlying data trends, resulting in poor performance on the test and training sets. Underfitting oversimplifies a problem and fails to capture the data’s complexity.

High Bias/Low Bias

refers to a training model’s tendency to consistently make the same types of errors even with different input data. Models with a high bias pay minimal attention to the training data, thereby oversimplifying the model. This leads to poor performance on the test and training datasets. The terms underfitting and high bias are closely related to machine learning concepts. Alternatively, low bias occurs when a model very closely fits the training data.

High Variance

means models change significantly with small training data changes. In machine learning models, high variance indicates that a model is too sensitive to specific aspects of the training dataset, including outliers and noise.

Low Variance

in machine learning models indicate that the models tend to be somewhat insensitive to the training data’s specifics. Models with low variance usually generalize better but can be prone to underfitting if overly simple.

Poor Generalization

happens when models fail to learn a dataset’s underlying patterns, which results in poor performance on new data.

Deep Learning

is the most commonly used type of machine learning in natural language processing. Powered by neural network layers, the algorithms in deep learning are modeled loosely on the workings of human brains.

Transfer Learning

enables further training for trained deep neural networks. With transfer learning, deep neural networks can achieve new tasks with reduced computing effort and training data.

Pretrained Models

are trained on various combinations of datasets, languages, and pre-training tasks. Users can download pre-trained models and fine-tune them for a wide array of differing target tasks.

Left Arrow Icon Back to Main Glossary

SUBSCRIBE TO OUR NEWSLETTER