educative.io

How does adding Regularization term work with overfit data

HI -
I am a little confused , how does adding a Regularization term reduce overfitting ? I am trying to understand the mathematical equation. Is there a way someone can explain with a simple example… May be comparison between using linear regression and Ridge Regression.

Hello @Ryan_Rodrigues,
Regularization is a technique used to prevent overfitting in machine learning models. It introduces a penalty term to the cost function that the model tries to minimize during training. The penalty term discourages large coefficients in the model, leading to simpler and more generalizable models.

Let’s understand this concept using a comparison between linear regression and ridge regression:

Linear Regression: In linear regression, the goal is to find a set of coefficients (weights) that minimize the sum of squared differences between the predicted values and the actual target values. The cost function for linear regression is given by:

Cost function (J) = Σ(yᵢ - ŷᵢ)²

where yᵢ is the actual target value, ŷᵢ is the predicted value, and the summation is taken over all data points.

In linear regression, there is no penalty for having large coefficients. As a result, the model may fit the training data very well (low training error), but it can perform poorly on unseen data (high test error) due to overfitting.

Ridge Regression (L2 Regularization): Ridge Regression introduces a regularization term to the cost function, which is proportional to the square of the magnitude of the coefficients. The cost function for Ridge Regression is given by:

Cost function (J) = Σ(yᵢ - ŷᵢ)² + α * Σ(βᵢ²)

where yᵢ is the actual target value, ŷᵢ is the predicted value, α is the regularization parameter (a hyperparameter that controls the strength of regularization), βᵢ is the coefficient corresponding to the ith feature, and the summations are taken over all data points and all features.

By introducing the penalty term α * Σ(βᵢ²), Ridge Regression discourages large values of the coefficients. As α increases, the regularization effect becomes stronger, and the model tends to reduce the magnitude of the coefficients. This leads to a simpler model that is less likely to overfit the training data.

Example: Let’s consider a simple example with two features (x₁ and x₂) and a target variable (y). We’ll fit both linear regression and Ridge Regression to the same dataset and observe how the coefficients change.

Suppose we have the following dataset:

x₁ x₂ y
1 2 3
2 4 6
3 6 7

Linear Regression: For linear regression, we’ll try to fit a model with the equation y = β₀ + β₁ * x₁ + β₂ * x₂. The coefficients β₀, β₁, and β₂ are learned during training.

The linear regression model may find the following coefficients (for illustration purposes): β₀ = 0.5, β₁ = 1.8, β₂ = 0.7

Ridge Regression: For Ridge Regression, the cost function is modified with the regularization term. We’ll try to fit a model with the equation y = β₀ + β₁ * x₁ + β₂ * x₂, and the regularization term is α * (β₁² + β₂²).

Let’s say we choose α = 0.1 for Ridge Regression. The Ridge Regression model may find the following coefficients (for illustration purposes): β₀ = 0.4, β₁ = 1.2, β₂ = 0.5

Comparison: Notice how the Ridge Regression coefficients are smaller compared to the linear regression coefficients. The regularization term in Ridge Regression penalizes large coefficient values, which helps to prevent overfitting. The Ridge Regression model is simpler and less prone to overfitting compared to linear regression, which can have larger coefficients leading to a more complex model.

In summary, the addition of the regularization term in Ridge Regression helps in reducing overfitting by shrinking the magnitude of the coefficients, leading to a more generalized and robust model.

I hope this helps.
Happy Learning :slight_smile: