Understanding Underfitting and Overfitting in Machine Learning Models

When training machine learning models, we aim to find the optimal level of model complexity that enables accurate predictions on new, unseen data. Underfitting and overfitting describe two scenarios that can occur when model complexity is not properly tuned.

Why Care About Underfitting and Overfitting?

Underfitting and overfitting relate to a model's ability to generalize. An underfit model fails to capture important patterns in the data, while an overfit model picks up too much noise. Both lead to poor performance on real-world data. By identifying underfitting and overfitting, we can tune model complexity and improve generalization capability.

Comparing Underfitting and Overfitting

Here is a table summarizing the key differences:

Underfitting Overfitting
Training error High Low
Validation error High High
Cause Model too simple Model too complex
Solution Increase model complexity Reduce model complexity

As we can see, an underfit model exhibits high error on both training and validation data, while an overfit model has low training error but high validation error.

Underfitting occurs when a model is too simplistic to capture the relationships within the data. Linear regression applied to non-linear data often underfits. Overfitting happens when a model picks up spurious patterns that don't generalize to new data. This can occur with highly complex and heavily parameterized models.

Fixing Underfitting and Overfitting

To fix underfitting, we need to increase model complexity by adding more features, decreasing regularization, or using a more flexible model. Fixing overfitting requires reducing model complexity through constraints such as higher regularization, feature selection, or simpler models.

Example in Python Scikit-Learn

Here is sample Python code using Scikit-Learn to demonstrate how regularization can address overfitting:

from sklearn.linear_model import Ridge

ridge = Ridge(alpha=1.0) 
ridge.fit(X_train, y_train)   # Fits model

print(ridge.score(X_train, y_train)) # Training accuracy
print(ridge.score(X_valid, y_valid)) # Validation accuracy

ridge = Ridge(alpha=100.0) # Increase regularization 
ridge.fit(X_train, y_train)

print(ridge.score(X_train, y_train)) 
print(ridge.score(X_valid, y_valid))

Increasing the alpha parameter strength of the L2 regularization reduces overfitting, as evidenced by lower training accuracy but higher validation accuracy.

Summary

Checking for underfitting and overfitting by comparing training and validation performance is crucial for achieving models that generalize well to new data. Adjusting model flexibility through regularization, constraints, and feature engineering is key to finding that optimal fit.