Regularization is a common method for dealing with overfitting in Machine Learning (ML). The simplest and most widely used methods are L1 (Lasso) and L2 (Ridge). The L1 and L2 regularizations are well covered in numerous tutorials and books. However, I could not find any good geometric or intuitive explanation of why L1 encourages coefficients to shrink to zero. This post tries to address this.
The lasso regression is a linear regression model that shrinks the coefficients by imposing a constraint on their magnitude. Namely, it constrains the sum of absolute values of the coefficients:
\[ \begin{align} \hat{\beta}^{lasso} = \underset{\beta}{argmin} & \sum_{i=1}^N \left( y_i-\beta_0-\sum_{j=1}^p{x_{ij}\beta_j} \right)^2\\ \text{ subject to } & \sum_{j=1}^p|\beta_j| \le t \end{align} \]
The above equation is the same equation (3.51) from “The Elements of Statistical Learning” (ESL) book by Hastie, Tibshirani, and Friedman.