Feng Li

School of Statistics and Mathematics

Central University of Finance and Economics

Suppose we want to find an minimum or maximum of a function $f(x)$

First order condition: Find the derivative $f'(x)$ and find $x^*$ such that $f'(x^*)=0$

This is the same as finding a root of the first derivative. We can use the Newton Raphson algorithm on the first derivative.

Select initial value $x_0$ and set $n=0$

Set $x_{n+1}=x_{n}-\frac{f'(x_n)}{f''(x_n)}$

Evaluate $|f'(x_{n+1})|$

If $|f'(x_{n+1})|<\epsilon$ then stop.

Otherwise set $n=n+1$ and go back to step 2.

- $| f'(x_{n}) | \leq \epsilon$

- $| x_{n}-x_{n-1} | \leq \epsilon$

- $| f(x_{n})-f(x_{n-1}) | \leq \epsilon$

Focus the

**step size**$-\frac{f'(x)}{f''(x)}$.The

**signs**of the derivatives control the**direction**of the next step.The

**size**of the derivatives control the**size**of the next step.Consider the concave function $f(x)=-x^4$ which has $f'(x)=-4x^3$ and $f''(x)=-12x^2$. There is a maximum at $x^{*}=0$

If $f''(x)$ is negative the function is locally

**concave**, and the search is for a local**maximum**To the left of this maximum $f'(x)>0$

Therefore $-\frac{f'(x)}{f''(x)}>0$.

The next step is to the right.

The reverse holds if $f'(x)<0$

Large absolute values of $f'(x)$ imply a steep slope. A big step is needed to get close to the optimum. The reverse hold for small absolute value of $f'(x)$.

If $f''(x)$ is positive the function is locally

**convex**, and the search is for a local**minimum**To the left of this maximum $f'(x)<0$

Therefore $-\frac{f'(x)}{f''(x)}>0$.

The next step is to the right.

The reverse holds if $f'(x)>0$

Large absolute values of $f'(x)$ imply a steep slope. A big step is needed to get close to the optimum. The reverse hold for small absolute value of $f'(x)$.

Together with the sign of the first derivative, the sign of the second derivative controls the direction of the next step.

A larger second derivative (in absolute value) implies a more curvature

In this case smaller steps are need to stop the algorithm from overshooting.

The opposite holds for a small second derivative.

Functions with more than one input

Most interesting optimization problems involve

**multiple**inputs.In determining the most risk efficient portfolio the return is a function of many weights (one for each asset).

In least squares estimation for a linear regression model, the sum of squares is a function of many coefficients (one for each regressor).

How do we optimize for functions $f({\mathbf{ x}})$ where ${\mathbf{ x}}$ is a vector?

- First derivative. Simply take the
**partial derivatives**and put them in a vector $$\frac{\partial y}{\partial{\mathbf x}}= \left(

\right)$$ This is called the`\begin{array}{c} \frac{\partial y}{\partial x_1}\\ \frac{\partial y}{\partial x_2}\\ \vdots\\ \frac{\partial y}{\partial x_d} \end{array}`

**gradient**vector.

- An example. The function $$y=x_1^2-x_1x_2+x_2^2+e^{x_2}$$

Has gradient vector $$\frac{\partial y}{\partial{\mathbf x}}= \left( \begin{array}{c} 2x_1-x_2\\ -x_1+2x_2+e^{x_2} \end{array} \right)$$

- Second derivative. Simply take the second order
**partial derivatives**. This will give a matrix $$\frac{\partial y}{\partial{\mathbf x}\partial{\mathbf x}'}= \left(

\right)$$ This is called the`\begin{array}{cccc} \frac{\partial^2 y}{\partial x_1^2}&\frac{\partial^2 y}{\partial x_1\partial x_2}&\cdots&\frac{\partial^2 y}{\partial x_1\partial x_d}\\ \frac{\partial^2 y}{\partial x_2\partial x_1}&\frac{\partial^2 y}{\partial x_2^2}&\cdots&\frac{\partial^2 y}{\partial x_2\partial x_d}\\ \vdots&\vdots&\ddots&\vdots\\ \frac{\partial^2 y}{\partial x_d\partial x_1}&\frac{\partial^2 y}{\partial x_d\partial x_2}&\cdots&\frac{\partial^2 y}{\partial x_d^2}\\ \end{array}`

**Hessian**matrix.

- An example. The function $$y=x_1^2-x_1x_2+x_2^2+e^{x_2}$$ has Hessian matrix $$\frac{\partial y}{\partial{\mathbf x}\partial{\mathbf x}'}=
\left(

\right)$$`\begin{array}{cc} 2 & -1\\ -1 & 2 + e^{x_2} \end{array}`

- The derivative of a vector
$\mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_m \\ \end{bmatrix}$
, by a scalar $x$ is written (in numerator layout notation) as
$$\begin{aligned}

$y$ with respect to a scalar $x$ is known as the tangent vector of the vector $y$, $\frac{\partial \mathbf{y}}{\partial x}$`\frac{\partial \mathbf{y}}{\partial x} = \begin{bmatrix} \frac{\partial y_1}{\partial x}\\ \frac{\partial y_2}{\partial x}\\ \vdots\\ \frac{\partial y_m}{\partial x}\\ \end{bmatrix}. \end{aligned}$$ In vector calculus the derivative of a vector`

**The derivative of a scalar $y$ by a vector**$\mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \\ \end{bmatrix}$ , is written (in numerator layout notation) as $$\begin{aligned}`\frac{\partial y}{\partial \mathbf{x}} = \left[ \frac{\partial y}{\partial x_1} \ \ \frac{\partial y}{\partial x_2} \ \ \cdots \ \ \frac{\partial y}{\partial x_n} \right]. \end{aligned}$$`

**The second order derivatives of a scalar $y$ by a vector**$\mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \\ \end{bmatrix}$ is written (in numerator layout notation) as$$\begin{aligned}

`\frac{\partial^2 y}{\partial \mathbf{x}\partial \mathbf{x}'}&=\frac{\partial }{\partial \mathbf{x}'}\left[\frac{\partial y}{\partial \mathbf{x}}\right]=\frac{\partial}{\partial \mathbf{x}'}\left[ \frac{\partial y}{\partial x_1} \ \ \frac{\partial y}{\partial x_2} \ \ \cdots \ \ \frac{\partial y}{\partial x_n} \right] \\&= \begin{bmatrix} \frac{\partial^2 y}{\partial x_1^2} & \frac{\partial^2 y}{\partial x_1\partial x_2} & \cdots & \frac{\partial^2 y}{\partial x_1\partial x_n}\\ \frac{\partial^2 y}{\partial x_2\partial x_1} & \frac{\partial^2 y}{\partial x_2^2} & \cdots & \frac{\partial^2 y}{\partial x_2\partial x_n}\\ \vdots & \vdots & \ddots & \vdots\\ \frac{\partial^2 y}{\partial x_m\partial x_1} & \frac{\partial^2 y}{\partial x_m\partial x_2} & \cdots & \frac{\partial^2 y}{\partial x_m\partial x_m}\\ \end{bmatrix}. \end{aligned}$$`

The derivative of a vector function (a vector whose components are functions) $\mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_m \\ \end{bmatrix}$ , with respect to an input vector, $\mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \\ \end{bmatrix}$ , is written (in numerator layout notation) as

$$\begin{aligned}

`\frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} & \cdots & \frac{\partial y_1}{\partial x_n}\\ \frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} & \cdots & \frac{\partial y_2}{\partial x_n}\\ \vdots & \vdots & \ddots & \vdots\\ \frac{\partial y_m}{\partial x_1} & \frac{\partial y_m}{\partial x_2} & \cdots & \frac{\partial y_m}{\partial x_n}\\ \end{bmatrix}. \end{aligned}$$`

- The derivative of a matrix function $Y$ by a scalar $x$ is known as
the tangent matrix and is given (in numerator layout notation) by
$$\begin{aligned}
`\frac{\partial \mathbf{Y}}{\partial x} = \begin{bmatrix} \frac{\partial y_{11}}{\partial x} & \frac{\partial y_{12}}{\partial x} & \cdots & \frac{\partial y_{1n}}{\partial x}\\ \frac{\partial y_{21}}{\partial x} & \frac{\partial y_{22}}{\partial x} & \cdots & \frac{\partial y_{2n}}{\partial x}\\ \vdots & \vdots & \ddots & \vdots\\ \frac{\partial y_{m1}}{\partial x} & \frac{\partial y_{m2}}{\partial x} & \cdots & \frac{\partial y_{mn}}{\partial x}\\ \end{bmatrix}. \end{aligned}$$`

The derivative of a scalar $y$ function of a matrix $X$ of independent variables, with respect to the matrix $X$, is given (in numerator layout notation) by

$$\begin{aligned}

`\frac{\partial y}{\partial \mathbf{X}} = \begin{bmatrix} \frac{\partial y}{\partial x_{11}} & \frac{\partial y}{\partial x_{21}} & \cdots & \frac{\partial y}{\partial x_{p1}}\\ \frac{\partial y}{\partial x_{12}} & \frac{\partial y}{\partial x_{22}} & \cdots & \frac{\partial y}{\partial x_{p2}}\\ \vdots & \vdots & \ddots & \vdots\\ \frac{\partial y}{\partial x_{1q}} & \frac{\partial y}{\partial x_{2q}} & \cdots & \frac{\partial y}{\partial x_{pq}}\\ \end{bmatrix}. \end{aligned}$$`

Many rules and tricks for matric derivatives are available from

We can now generalise the update step in Newton’s method: $$x_{n+1}=x_n-\left(\frac{\partial^2 f({\mathbf x})}{\partial {\mathbf x}\partial{\mathbf x}'}\right)^{-1} \frac{\partial f({\mathbf x})}{\partial {\mathbf x}}$$

Write code to minimise $y=x_1^2-x_1x_2+x_2^2+e^{x_2}$

Consider the linear regression model with multiple covariates,

$$y_i = \beta_0 + \beta_1x_{1i}+...+\beta_p x_{pi} + \epsilon_i$$ where $\epsilon_i \sim N(0, \sigma^2)$

What is the (log) likelihood function?

What are the unknown parameters? Let’s consider three situations

When $\beta_0=1$ and $\sigma^2=1$ known.

When $\sigma^2=1$ known.

Neither $\beta_i$ nor $\sigma$ is known.

What is the gradient and Hessian matrix for the log likelihood ($\mathcal{L}$) with respect to the parameter vector $\mathbf{\beta}=(\beta_0,...,\beta_p)'$?

$$\frac{\partial log \mathcal{L}}{\partial \mathbf{\beta}} = ?$$

$$\frac{\partial^2 log \mathcal{L}}{\partial \mathbf{\beta} \partial \mathbf{\beta}'}

`= ?$$`

Write down the likelihood function with respect to the unknown parameters.

Write down the gradient for the likelihood function.

Write down the Hessian for the likelihood function.

Use Newton’s method to obtain the best parameter estimate.

**The idea**: using maximum likelihood method with binomial distribution.One owns a house ($Y=1$) or do not own a house ($Y=0$) can be represented with

**Bernoulli distribution**$$Pr(y;p) = p^y (1-p)^{1-y}\!\quad \text{for }y\in\{0,1\}.$$

The log likelihood function is as follows $$\begin{split}

`l(\beta)=& \sum \limits_{n=1}^N\left\{ y_i \log P_i + (1- y_i) \log (1-P_i) \right\}\\ % =& \sum \limits_{n=1}^N \left\{y_i\beta'x_i -(1-y_i)\log (1+\exp\{1+\beta'x_i\}) \right\} \end{split}$$ where`

$$P_i=\frac{1}{1+\exp(-(\beta_1+\beta_2X_{2i}+...+\beta_pX_{pi}))}$$

Note that the sum of $n$ Bernoulli samples will be

**binomial**distributed.To obtain $\hat \beta$, use Newton-Raphson algorithm $$\beta^{new} = \beta^{old} - \left(\frac{\partial^2l(\beta)}{\partial

`\beta\partial \beta'} \right)^{-1} \frac{\partial l(\beta)}{\partial\beta} | _{\beta= \beta^{old}}$$`

Use Newton’s method to find the maximum likelihood estimators for the coefficients in a logistic regression. The steps are:

Write down likelihood function

Find the gradient and Hessian matrix

Code these up in R/Python

Using the following simulated data to test your code

In [7]:

```
# Create some example data
set.seed(123)
n <- 1000
x1 <- rnorm(n)
x2 <- rnorm(n)
y <- rbinom(n, 1, plogis(1 + 2*x1 + 3*x2))
# Fit a logistic regression model
model <- glm(y ~ x1 + x2, family = binomial(link = "logit"))
summary(model)
```