Over-Fitting VS Under-Fitting

Saurabh Gupta
Analytics Vidhya
Published in
3 min readMar 26, 2021

--

Let’s start with discussing the terminologies used in the image.

Bias- Represent error in training data.

Variance- Represent error in test data.

Over-Fitting- The algorithm is showing a good fit on training data but not on the test data i.e low bias and high variance.

Under-Fitting- The algorithm is neither showing a good fit on training data nor the test data i.e high bias and high variance.

Now we know what is Over-Fitting and Under-Fitting. Let’s discuss what should we do when we have this problem.

Over-Fitting

  1. Try the regularized model

Regularized regression is a type of regression where the coefficient estimates are constrained to zero. The magnitude (size) of coefficients, as well as the magnitude of the error term, are penalized.

Lasso regression:- Lasso regression performs L1 regularization, which adds a penalty equal to the absolute value of the magnitude of coefficients. This type of regularization can lead to different models with fewer coefficients; Some coefficients can be zero and eliminated from the model.

Ridge regression:- Ridge regression belongs to a class of regression tools that use L2 regularization. The other type of regularization, L1 regularization, limits the size of the coefficients by adding an L1 penalty equal to the absolute value of the magnitude of coefficients. This sometimes results in the removal of some coefficients altogether. L2 regularization adds an L2 penalty, which equals the square of the magnitude of coefficients. All coefficients are reduced by the same factor (so none are eliminated).

2. Try hyper-parameters tuning.

Hyperparameter tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is used to control the learning process.

3. Pruning is tree-based.

Pruning is one of the methods used to overcome our Overfitting problem. Pruning, in its literal sense, is a practice that involves the selective removal of certain parts of a tree (or plant), such as branches, shoots, or roots, to promote tree formation, and to promote healthy growth.

4. Using Cross-Validation.

5. Try Simpler models.

6. Try getting more training data.

7. Using Early stopping if algorithms allow.

Under-fitting

  1. Increase the complexity of the model.
  2. Reduce regularization.
  3. The increasing number of iterations can help.

Imbalanced Class

Imbalanced data is another big issue. It shows high accuracy even though a model is performing poorly. Here are some ways to handle imbalanced dataset.

I have explained in detail about Accuracy, Precision and Recall. Please check out my article on it. LINK

  1. Over-sampling: It refers to increasing the number of rows for the class which have a small frequency. It requires actually getting more data for the smaller class which can be difficult. ( Tip: Duplicating Rows may help)
  2. Under-sampling: It refers to decreasing the number of rows for the bigger class to make training more balanced.
  3. Generating data using SMOTE: Synthetic Minority Over-sampling Technique works on generating new data points for the smaller class.
  4. Class Weights: Some algorithm allows you to assign different weights to class to improve training.

Final Thoughts

It is important to handle over-fitting, under-fitting and imbalanced dataset. It can cause a big issue in the production despite giving good results in the training. We need to understand what type of data we are having and what type of algorithm we are using in order to use the proper technique.

--

--