Secant Method: Newton's Method 📂Optimization

Secant Method: Newton's Method

Definition¹ ²

In the problem of optimizing the objective function $J : \mathbb{R}^{n} \to \mathbb{R}$ , the following iterative algorithm is called Newton’s method.

$\begin{equation} \mathbf{x}_{n+1} = \mathbf{x}_{n} - H^{-1}(\mathbf{x}_{n}) \nabla J(\mathbf{x}_{n}) \end{equation}$

$\nabla J$ is the gradient, and $H$ is the Hessian of $J$ .

Derivation

By approximating $J$ with up to the second derivative term using the Taylor expansion, it looks like this:

$J(\mathbf{x}) \approx J(\mathbf{x}_{0}) + (\mathbf{x} - \mathbf{x}_{0})^{T} \nabla J (\mathbf{x}_{0}) + \dfrac{1}{2!}(\mathbf{x} - \mathbf{x}_{0})^{T} (H(\mathbf{x}_{0})) (\mathbf{x} - \mathbf{x}_{0})$

To minimize $J$ , if you differentiate by $\mathbf{x}$ (let’s omit the variables of the gradient and Hessian)

$\dfrac{\partial J(\mathbf{x})}{\partial \mathbf{x}} = \nabla J + (\mathbf{x} - \mathbf{x}_{0}) H = \mathbf{0}$ $\begin{equation} \implies \mathbf{x}^{\ast} = \mathbf{x}_{0} -H^{-1}\nabla J \end{equation}$

Explanation

If the standard gradient descent uses the first derivative for the optimization of the objective function, Newton’s method uses up to the second derivative. Comparing the equations, the learning rate $\alpha$ is replaced with the inverse of the Hessian, which can be seen as an adaptive learning rate.

$\begin{align*} \text{Gradient Descent: } \mathbf{x}_{n+1} &= \mathbf{x}_{n} - \alpha \nabla J \\ \text{Newton’s Method: } \mathbf{x}_{n+1} &= \mathbf{x}_{n} - H^{-1} \nabla J \end{align*}$

Pay attention to the fact that up to the second term of the Taylor expansion is used in the derivation. If $J$ is (locally) a quadratic function, then $(2)$ is the formula that finds the (locally) optimal solution $\mathbf{x}^{\ast}$ in one go. Of course, in general, it is hard to come across such a good situation, so $(1)$ needs to be repeated to find $\mathbf{x}^{\ast}$ .

Considering the computation of the inverse matrix, one might guess that it is not a practical algorithm to use. Storing the Hessian itself consumes a lot of memory, and the cost of computing the inverse matrix of $n \times n$ is $\mathcal{O}(n^{3})$ , so if $n$ is large, it might be more efficient to do gradient descent a few more times due to the time taken to compute the inverse matrix. Especially in deep learning, where artificial neural networks require a vast number of parameters to perform well, thinking about ChatGPT-3.5 having 175 billion parameters, it would be tough to expect to train it using Newton’s method. Several improved algorithms have been designed to overcome the high computation cost of Hessians.

Below, the equation for the single variable case will clearly show that it resembles the Newton’s method in root finding.

$x_{n+1} = x_{n} - \dfrac{f^{\prime}(x_{n})}{f^{\prime \prime}(x_{n})}$

Ian Goodfellow, Deep Learning, 8.6.1 Newton’s Method ↩︎
오일석, 기계 학습(MACHINE LEARNING) (2017), Ch 5.6.1 뉴턴 방법 ↩︎