Explain L1 and L2 regularisation in Machine Learning

What is regularisation

When training a machine learning model, there typically will have an objective function that it needs to be optimised. This objective function often refers to loss function or cost function.

While optimising the cost function, overfitting often happens that it causes a relatively large performance differences in accuracy between training and validation set. There are many approaches to resolve model overfitting, cross-validation, obtaining more data (or creating more data), removing features, early stopping, ensembling, regularisation and so on. This post will look at two variations of regularisation. To be understood, regularisation not necessary always improves the model performance, its main goal is to fix model overfitting, so the performance of training and validation fall similarly.

How does regularisation work

Intuitively, regularisation adds pressure to the freedom of a model. Take the simplest regression model Y = Wx + b as an example, regularisation adds a new term to the equation so it makes the W less freedom to adjust to the target value(Y). Y = Wx + b + Ω(W)

L1 and L2 regularisation

It turns out, there are many different kinds of choices of regularisation function (Ω). The widely used one is p-norm.

The equation is below

\left|\left|x\right|\right|_p=\left(\sum_i^N\left|x_i\right|^p\right)^{\frac{1}{2}}

When p=1,

\left|\left|x_1\right|\right|= \left|x_1\right|+\left|x_2\right|+…+\left|x_n\right| = \sum_{i=1}^N\left|x_i\right|

We get L1 Norm (aka L1 regularisation, LASSO). From the equation, we can see it calculates the sum of absolute value of the magnitude of model’s coefficients. It limits the size of the coefficients.

And when p=2, it becomes

\left|\left|x_2\right|\right|= \left(\left|x_1\right|^2+\left|x_2\right|^2+…+\left|x_n\right|^2\right)^{\frac{1}{2}} = \left(\sum_{i=1}^N\left|x_i\right|^2\right)^{\frac{1}{2}} = \sqrt{\sum _i^Nx_i^2}

We call it L2 norm (L2 regularisation, Euclidean norm or Ridge form). It computes the sum of the squared magnitude of coefficients and it will not yield sparse outputs and all coefficients are shrunk by the same factor.

The main difference between these techniques is that L1 shrinks the less important feature’s coefficient down to zero. Causing it inexplicitly removed the feature, it can be used in feature analysis. This model works well if the dataset has super wide columns as the sparsity in the outputs. Ridge regularisation, on the other hand, L2 tends to keep the coefficients less important features tiny rather than zero them out.

According to a paper by Andrew Ng, if you expect the dataset contains a high number of irrelevant features, L1 requires fewer training samples to generalise well with L1 compares to L2. For other cases, L2 does better otherwise.

Reference

[1] (http://ai.stanford.edu/~ang/papers/icml04-l1l2.pdf)

[2] (https://blog.alexlenail.me/what-is-the-difference-between-ridge-regression-the-lasso-and-elasticnet-ec19c71c9028)