Brian Zhang's blog

Statistics and other topics

A Regularization Proof

Posted Sep 19, 2022 · 3 min read

Say we have a loss function \(l(w)\). With no regularization, we might obtain the minimum at \(w = w_0\). Now consider the setting with regularization: \[ f_\lambda(w) = l(w) + \lambda R(w), \] where \(R(w) \geq 0\) is some regularization function and \(\lambda \geq 0\). What can we say if we consider the minimizing inputs \(w_1\) for \(f_{\lambda_1}(w)\) and \(w_2\) for \(f_{\lambda_2}(w)\), with \(0 \leq \lambda_1 < \lambda_2\)? \[ w_1 = argmin_w \left[ l(w) + \lambda_1 R(w) \right],\\ w_2 = argmin_w \left[ l(w) + \lambda_2 R(w) \right]. \]

Intuitively, as we increase \(\lambda\) from \(\lambda_1\) to \(\lambda_2\), the function \(f_\lambda(w)\) places more importance on the regularization term \(R(w)\). We should expect \(l(w)\) evaluated at the optimum \(w\) to increase, and the regularization term \(R(w)\) evaluated at the optimum \(w\) to decrease.

By the properties of the optimum, we have \[\begin{gather} l(w_1) + \lambda_1 R(w_1) \leq l(w_2) + \lambda_1 R(w_2), \quad (1)\\ l(w_2) + \lambda_2 R(w_2) \leq l(w_1) + \lambda_2 R(w_1). \quad (2) \end{gather}\] The only other information we have relating these terms is that \(R(w) \geq 0\) (for all \(w\)) and \(0 \leq \lambda_1 < \lambda_2\). So we work with what we have. First, leveraging \((1)\), \[\begin{align*} f_{\lambda_1}(w_1) &= l(w_1) + \lambda_1 R(w_1)\\ &\leq l(w_2) + \lambda_1 R(w_2)\\ &\leq l(w_2) + \lambda_2 R(w_2)\\ &= f_{\lambda_2}(w_2), \end{align*}\] so the minimum of the optimized function increases (or stays the same) as we increase \(\lambda\). This can also be proved as \(f_{\lambda_2}(w) \geq f_{\lambda_1}(w)\) for all \(w\).

The other inequalities are trickier. Observe (starting with \((2)\)): \[\begin{align*} l(w_1) + \lambda_2 R(w_1) &\geq l(w_2) + \lambda_2 R(w_2)\\ &= l(w_2) + (\lambda_1 + \lambda_2 - \lambda_1) R(w_2)\\ &= \left[l(w_2) + \lambda_1 R(w_2)\right] + (\lambda_2 - \lambda_1) R(w_2)\\ &\geq \left[l(w_1) + \lambda_1 R(w_1)\right] + (\lambda_2 - \lambda_1) R(w_2). \end{align*}\] Subtracting \((l(w_1) + \lambda_1 R(w_1))\) from both sides, we have \[ (\lambda_2 - \lambda_1) R(w_1) \geq (\lambda_2 - \lambda_1) R(w_2). \] \(\lambda_2 - \lambda_1 > 0\), so dividing on both sides, \[ R(w_1) \geq R(w_2). \] In words, the minimum of the regularization component (not including the factor of \(\lambda\)) decreases (or stays the same) as we increase \(\lambda\).1

Starting with \((1)\) and leveraging this fact, we additionally have \[\begin{align*} l(w_1) + \lambda_1 R(w_1) &\leq l(w_2) + \lambda_1 R(w_2)\\ &\leq l(w_2) + \lambda_1 R(w_1) \end{align*}\] Subtracting \(\lambda_1 R(w_1)\) from both sides, we obtain \[ l(w_1) \leq l(w_2). \] In words, the minimum of the loss function component increases (or stays the same) as we increase \(\lambda\).2


  1. An alternate proof, by adding \((1)\) with \((2)\): \[ l(w_1) + l(w_2) + \lambda_1 R(w_1) + \lambda_2 R(w_2) \leq l(w_1) + l(w_2) + \lambda_1 R(w_2) + \lambda_2 R(w_1),\\ \lambda_1 R(w_1) + \lambda_2 R(w_2) \leq \lambda_1 R(w_2) + \lambda_2 R(w_1),\\ (\lambda_2 - \lambda_1) R(w_2) \leq (\lambda_2 - \lambda_1) R(w_1),\\ R(w_2) \leq R(w_1). \]↩︎

  2. An alternate proof, by adding \(1/\lambda_1\) times \((1)\) with \(1/\lambda_2\) times \((2)\): \[ \frac{l(w_1)}{\lambda_1} + \frac{l(w_2)}{\lambda_2} + R(w_1) + R(w_2) \leq \frac{l(w_2)}{\lambda_1} + \frac{l(w_1)}{\lambda_2} + R(w_2) + R(w_1),\\ \frac{l(w_1)}{\lambda_1} + \frac{l(w_2)}{\lambda_2} \leq \frac{l(w_2)}{\lambda_1} + \frac{l(w_1)}{\lambda_2},\\ \left(\frac{1}{\lambda_1} - \frac{1}{\lambda_2}\right) l(w_1) \leq \left(\frac{1}{\lambda_1} - \frac{1}{\lambda_2}\right) l(w_2) ,\\ l(w_1) \leq l(w_2). \]↩︎

comments powered by Disqus