Newton's Algorithm¶

Feng Li

School of Statistics and Mathematics

Central University of Finance and Economics

feng.li@cufe.edu.cn

https://feng.li/statcomp

Finding a maximum/minimum¶

  • Suppose we want to find an minimum or maximum of a function $f(x)$

  • First order condition: Find the derivative $f'(x)$ and find $x^*$ such that $f'(x^*)=0$

  • This is the same as finding a root of the first derivative. We can use the Newton Raphson algorithm on the first derivative.

Newton’s algorithm for finding local minima/maxima¶

  1. Select initial value $x_0$ and set $n=0$

  2. Set $x_{n+1}=x_{n}-\frac{f'(x_n)}{f''(x_n)}$

  3. Evaluate $|f'(x_{n+1})|$

    • If $|f'(x_{n+1})|<\epsilon$ then stop.

    • Otherwise set $n=n+1$ and go back to step 2.

Different Stopping Rules Three stopping rules can be used¶

  • $| f'(x_{n}) | \leq \epsilon$
  • $| x_{n}-x_{n-1} | \leq \epsilon$
  • $| f(x_{n})-f(x_{n-1}) | \leq \epsilon$

Intuition¶

  • Focus the step size $-\frac{f'(x)}{f''(x)}$.

  • The signs of the derivatives control the direction of the next step.

  • The size of the derivatives control the size of the next step.

  • Consider the concave function $f(x)=-x^4$ which has $f'(x)=-4x^3$ and $f''(x)=-12x^2$. There is a maximum at $x^{*}=0$

Role of first derivative¶

  • If $f''(x)$ is negative the function is locally concave, and the search is for a local maximum

  • To the left of this maximum $f'(x)>0$

  • Therefore $-\frac{f'(x)}{f''(x)}>0$.

  • The next step is to the right.

  • The reverse holds if $f'(x)<0$

  • Large absolute values of $f'(x)$ imply a steep slope. A big step is needed to get close to the optimum. The reverse hold for small absolute value of $f'(x)$.

  • If $f''(x)$ is positive the function is locally convex, and the search is for a local minimum

  • To the left of this maximum $f'(x)<0$

  • Therefore $-\frac{f'(x)}{f''(x)}>0$.

  • The next step is to the right.

  • The reverse holds if $f'(x)>0$

  • Large absolute values of $f'(x)$ imply a steep slope. A big step is needed to get close to the optimum. The reverse hold for small absolute value of $f'(x)$.

Role of second derivative¶

  • Together with the sign of the first derivative, the sign of the second derivative controls the direction of the next step.

  • A larger second derivative (in absolute value) implies a more curvature

  • In this case smaller steps are need to stop the algorithm from overshooting.

  • The opposite holds for a small second derivative.

Multidimensional Optimization¶

Functions with more than one input

  • Most interesting optimization problems involve multiple inputs.

    • In determining the most risk efficient portfolio the return is a function of many weights (one for each asset).

    • In least squares estimation for a linear regression model, the sum of squares is a function of many coefficients (one for each regressor).

  • How do we optimize for functions $f({\mathbf{ x}})$ where ${\mathbf{ x}}$ is a vector?

  • First derivative. Simply take the partial derivatives and put them in a vector $$\frac{\partial y}{\partial{\mathbf x}}= \left(
    \begin{array}{c}
      \frac{\partial y}{\partial x_1}\\
      \frac{\partial y}{\partial x_2}\\
      \vdots\\
      \frac{\partial y}{\partial x_d}
    \end{array}
    
    \right)$$ This is called the gradient vector.
  • An example. The function $$y=x_1^2-x_1x_2+x_2^2+e^{x_2}$$

Has gradient vector $$\frac{\partial y}{\partial{\mathbf x}}= \left( \begin{array}{c} 2x_1-x_2\\ -x_1+2x_2+e^{x_2} \end{array} \right)$$

  • Second derivative. Simply take the second order partial derivatives. This will give a matrix $$\frac{\partial y}{\partial{\mathbf x}\partial{\mathbf x}'}= \left(
    \begin{array}{cccc}
      \frac{\partial^2 y}{\partial x_1^2}&\frac{\partial^2 y}{\partial x_1\partial x_2}&\cdots&\frac{\partial^2 y}{\partial x_1\partial x_d}\\
      \frac{\partial^2 y}{\partial x_2\partial x_1}&\frac{\partial^2 y}{\partial x_2^2}&\cdots&\frac{\partial^2 y}{\partial x_2\partial x_d}\\
      \vdots&\vdots&\ddots&\vdots\\
      \frac{\partial^2 y}{\partial x_d\partial x_1}&\frac{\partial^2 y}{\partial x_d\partial x_2}&\cdots&\frac{\partial^2 y}{\partial x_d^2}\\
    \end{array}
    
    \right)$$ This is called the Hessian matrix.
  • An example. The function $$y=x_1^2-x_1x_2+x_2^2+e^{x_2}$$ has Hessian matrix $$\frac{\partial y}{\partial{\mathbf x}\partial{\mathbf x}'}= \left(
    \begin{array}{cc}
      2 & -1\\
      -1 & 2 + e^{x_2}
    \end{array}
    
    \right)$$

Preliminaries for matrix derivatives¶

  1. The derivative of a vector $\mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_m \\ \end{bmatrix}$ , by a scalar $x$ is written (in numerator layout notation) as $$\begin{aligned}
      \frac{\partial \mathbf{y}}{\partial x} = \begin{bmatrix} \frac{\partial y_1}{\partial x}\\ \frac{\partial y_2}{\partial x}\\ \vdots\\ \frac{\partial y_m}{\partial x}\\ \end{bmatrix}.
    \end{aligned}$$ In vector calculus the derivative of a vector
    
    $y$ with respect to a scalar $x$ is known as the tangent vector of the vector $y$, $\frac{\partial \mathbf{y}}{\partial x}$
  1. The derivative of a scalar $y$ by a vector $\mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \\ \end{bmatrix}$ , is written (in numerator layout notation) as $$\begin{aligned}
      \frac{\partial y}{\partial \mathbf{x}} = \left[ \frac{\partial y}{\partial x_1} \ \ \frac{\partial y}{\partial x_2} \ \ \cdots \ \ \frac{\partial y}{\partial x_n} \right].
    \end{aligned}$$
  1. The second order derivatives of a scalar $y$ by a vector $\mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \\ \end{bmatrix}$ is written (in numerator layout notation) as

    $$\begin{aligned}

      \frac{\partial^2 y}{\partial \mathbf{x}\partial \mathbf{x}'}&=\frac{\partial
                                                                    }{\partial \mathbf{x}'}\left[\frac{\partial y}{\partial \mathbf{x}}\right]=\frac{\partial}{\partial \mathbf{x}'}\left[ \frac{\partial y}{\partial x_1} \ \ \frac{\partial y}{\partial x_2} \ \ \cdots \ \ \frac{\partial y}{\partial x_n} \right] \\&= \begin{bmatrix} \frac{\partial^2 y}{\partial x_1^2} & \frac{\partial^2 y}{\partial x_1\partial x_2} & \cdots & \frac{\partial^2 y}{\partial x_1\partial x_n}\\ \frac{\partial^2 y}{\partial x_2\partial x_1} & \frac{\partial^2 y}{\partial x_2^2} & \cdots & \frac{\partial^2 y}{\partial x_2\partial x_n}\\ \vdots & \vdots & \ddots & \vdots\\ \frac{\partial^2 y}{\partial x_m\partial x_1} & \frac{\partial^2 y}{\partial x_m\partial x_2} & \cdots & \frac{\partial^2 y}{\partial x_m\partial x_m}\\ \end{bmatrix}.
    \end{aligned}$$
  1. The derivative of a vector function (a vector whose components are functions) $\mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_m \\ \end{bmatrix}$ , with respect to an input vector, $\mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \\ \end{bmatrix}$ , is written (in numerator layout notation) as

    $$\begin{aligned}

      \frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} & \cdots & \frac{\partial y_1}{\partial x_n}\\ \frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} & \cdots & \frac{\partial y_2}{\partial x_n}\\ \vdots & \vdots & \ddots & \vdots\\ \frac{\partial y_m}{\partial x_1} & \frac{\partial y_m}{\partial x_2} & \cdots & \frac{\partial y_m}{\partial x_n}\\ \end{bmatrix}.
    \end{aligned}$$
  1. The derivative of a matrix function $Y$ by a scalar $x$ is known as the tangent matrix and is given (in numerator layout notation) by $$\begin{aligned}
      \frac{\partial \mathbf{Y}}{\partial x} = \begin{bmatrix} \frac{\partial y_{11}}{\partial x} & \frac{\partial y_{12}}{\partial x} & \cdots & \frac{\partial y_{1n}}{\partial x}\\ \frac{\partial y_{21}}{\partial x} & \frac{\partial y_{22}}{\partial x} & \cdots & \frac{\partial y_{2n}}{\partial x}\\ \vdots & \vdots & \ddots & \vdots\\ \frac{\partial y_{m1}}{\partial x} & \frac{\partial y_{m2}}{\partial x} & \cdots & \frac{\partial y_{mn}}{\partial x}\\ \end{bmatrix}.
    \end{aligned}$$
  1. The derivative of a scalar $y$ function of a matrix $X$ of independent variables, with respect to the matrix $X$, is given (in numerator layout notation) by

    $$\begin{aligned}

      \frac{\partial y}{\partial \mathbf{X}} = \begin{bmatrix} \frac{\partial y}{\partial x_{11}} & \frac{\partial y}{\partial x_{21}} & \cdots & \frac{\partial y}{\partial x_{p1}}\\ \frac{\partial y}{\partial x_{12}} & \frac{\partial y}{\partial x_{22}} & \cdots & \frac{\partial y}{\partial x_{p2}}\\ \vdots & \vdots & \ddots & \vdots\\ \frac{\partial y}{\partial x_{1q}} & \frac{\partial y}{\partial x_{2q}} & \cdots & \frac{\partial y}{\partial x_{pq}}\\ \end{bmatrix}.
    \end{aligned}$$
  1. Many rules and tricks for matric derivatives are available from

    Helmut Lütkepohl (1996), Handbook of Matrices, Chapter 10

Newton’s algorithm for multidimensional optimization¶

We can now generalise the update step in Newton’s method: $$x_{n+1}=x_n-\left(\frac{\partial^2 f({\mathbf x})}{\partial {\mathbf x}\partial{\mathbf x}'}\right)^{-1} \frac{\partial f({\mathbf x})}{\partial {\mathbf x}}$$

Lab¶

Write code to minimise $y=x_1^2-x_1x_2+x_2^2+e^{x_2}$

The linear regression model, a revisit¶

  • Consider the linear regression model with multiple covariates,

    $$y_i = \beta_0 + \beta_1x_{1i}+...+\beta_p x_{pi} + \epsilon_i$$ where $\epsilon_i \sim N(0, \sigma^2)$

  • What is the (log) likelihood function?

  • What are the unknown parameters? Let’s consider three situations

    • When $\beta_0=1$ and $\sigma^2=1$ known.

    • When $\sigma^2=1$ known.

    • Neither $\beta_i$ nor $\sigma$ is known.

Maximum likelihood Estimate for linear models¶

  • What is the gradient and Hessian matrix for the log likelihood ($\mathcal{L}$) with respect to the parameter vector $\mathbf{\beta}=(\beta_0,...,\beta_p)'$?

    $$\frac{\partial log \mathcal{L}}{\partial \mathbf{\beta}} = ?$$

    $$\frac{\partial^2 log \mathcal{L}}{\partial \mathbf{\beta} \partial \mathbf{\beta}'}

      = ?$$
  • Write down the likelihood function with respect to the unknown parameters.

  • Write down the gradient for the likelihood function.

  • Write down the Hessian for the likelihood function.

  • Use Newton’s method to obtain the best parameter estimate.

Vector Newton-Raphson Algorithm: The logit model¶

  • The idea: using maximum likelihood method with binomial distribution.

  • One owns a house ($Y=1$) or do not own a house ($Y=0$) can be represented with Bernoulli distribution $$Pr(y;p) = p^y (1-p)^{1-y}\!\quad \text{for }y\in\{0,1\}.$$

  • The log likelihood function is as follows $$\begin{split}

        l(\beta)=& \sum \limits_{n=1}^N\left\{ y_i \log P_i + (1- y_i)
          \log (1-P_i)  \right\}\\
        % =& \sum \limits_{n=1}^N \left\{y_i\beta'x_i -(1-y_i)\log (1+\exp\{1+\beta'x_i\})  \right\}
      \end{split}$$ where
    

    $$P_i=\frac{1}{1+\exp(-(\beta_1+\beta_2X_{2i}+...+\beta_pX_{pi}))}$$

  • Note that the sum of $n$ Bernoulli samples will be binomial distributed.

  • To obtain $\hat \beta$, use Newton-Raphson algorithm $$\beta^{new} = \beta^{old} - \left(\frac{\partial^2l(\beta)}{\partial

          \beta\partial \beta'}  \right)^{-1} \frac{\partial
        l(\beta)}{\partial\beta} | _{\beta= \beta^{old}}$$

Let's code them up¶

  • Use Newton’s method to find the maximum likelihood estimators for the coefficients in a logistic regression. The steps are:

    • Write down likelihood function

    • Find the gradient and Hessian matrix

    • Code these up in R/Python

    • Using the following simulated data to test your code

In [7]:
# Create some example data
set.seed(123)
n <- 1000
x1 <- rnorm(n)
x2 <- rnorm(n)
y <- rbinom(n, 1, plogis(1 + 2*x1 + 3*x2))

# Fit a logistic regression model
model <- glm(y ~ x1 + x2, family = binomial(link = "logit"))
summary(model)
Call:
glm(formula = y ~ x1 + x2, family = binomial(link = "logit"))

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-3.00404  -0.25042   0.06014   0.34895   2.97130  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   1.1877     0.1291   9.199   <2e-16 ***
x1            2.1243     0.1768  12.018   <2e-16 ***
x2            3.4635     0.2395  14.460   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1329.1  on 999  degrees of freedom
Residual deviance:  515.4  on 997  degrees of freedom
AIC: 521.4

Number of Fisher Scoring iterations: 7