Feng Li
School of Statistics and Mathematics
Central University of Finance and Economics
Suppose we want to find an minimum or maximum of a function $f(x)$
First order condition: Find the derivative $f'(x)$ and find $x^*$ such that $f'(x^*)=0$
This is the same as finding a root of the first derivative. We can use the Newton Raphson algorithm on the first derivative.
Select initial value $x_0$ and set $n=0$
Set $x_{n+1}=x_{n}-\frac{f'(x_n)}{f''(x_n)}$
Evaluate $|f'(x_{n+1})|$
If $|f'(x_{n+1})|<\epsilon$ then stop.
Otherwise set $n=n+1$ and go back to step 2.
Focus the step size $-\frac{f'(x)}{f''(x)}$.
The signs of the derivatives control the direction of the next step.
The size of the derivatives control the size of the next step.
Consider the concave function $f(x)=-x^4$ which has $f'(x)=-4x^3$ and $f''(x)=-12x^2$. There is a maximum at $x^{*}=0$
If $f''(x)$ is negative the function is locally concave, and the search is for a local maximum
To the left of this maximum $f'(x)>0$
Therefore $-\frac{f'(x)}{f''(x)}>0$.
The next step is to the right.
The reverse holds if $f'(x)<0$
Large absolute values of $f'(x)$ imply a steep slope. A big step is needed to get close to the optimum. The reverse hold for small absolute value of $f'(x)$.
If $f''(x)$ is positive the function is locally convex, and the search is for a local minimum
To the left of this maximum $f'(x)<0$
Therefore $-\frac{f'(x)}{f''(x)}>0$.
The next step is to the right.
The reverse holds if $f'(x)>0$
Large absolute values of $f'(x)$ imply a steep slope. A big step is needed to get close to the optimum. The reverse hold for small absolute value of $f'(x)$.
Together with the sign of the first derivative, the sign of the second derivative controls the direction of the next step.
A larger second derivative (in absolute value) implies a more curvature
In this case smaller steps are need to stop the algorithm from overshooting.
The opposite holds for a small second derivative.
Functions with more than one input
Most interesting optimization problems involve multiple inputs.
In determining the most risk efficient portfolio the return is a function of many weights (one for each asset).
In least squares estimation for a linear regression model, the sum of squares is a function of many coefficients (one for each regressor).
How do we optimize for functions $f({\mathbf{ x}})$ where ${\mathbf{ x}}$ is a vector?
\begin{array}{c}
\frac{\partial y}{\partial x_1}\\
\frac{\partial y}{\partial x_2}\\
\vdots\\
\frac{\partial y}{\partial x_d}
\end{array}
\right)$$ This is called the gradient vector.Has gradient vector $$\frac{\partial y}{\partial{\mathbf x}}= \left( \begin{array}{c} 2x_1-x_2\\ -x_1+2x_2+e^{x_2} \end{array} \right)$$
\begin{array}{cccc}
\frac{\partial^2 y}{\partial x_1^2}&\frac{\partial^2 y}{\partial x_1\partial x_2}&\cdots&\frac{\partial^2 y}{\partial x_1\partial x_d}\\
\frac{\partial^2 y}{\partial x_2\partial x_1}&\frac{\partial^2 y}{\partial x_2^2}&\cdots&\frac{\partial^2 y}{\partial x_2\partial x_d}\\
\vdots&\vdots&\ddots&\vdots\\
\frac{\partial^2 y}{\partial x_d\partial x_1}&\frac{\partial^2 y}{\partial x_d\partial x_2}&\cdots&\frac{\partial^2 y}{\partial x_d^2}\\
\end{array}
\right)$$ This is called the Hessian matrix.\begin{array}{cc}
2 & -1\\
-1 & 2 + e^{x_2}
\end{array}
\right)$$ \frac{\partial \mathbf{y}}{\partial x} = \begin{bmatrix} \frac{\partial y_1}{\partial x}\\ \frac{\partial y_2}{\partial x}\\ \vdots\\ \frac{\partial y_m}{\partial x}\\ \end{bmatrix}.
\end{aligned}$$ In vector calculus the derivative of a vector
$y$ with respect to a scalar $x$ is known as the tangent vector of
the vector $y$, $\frac{\partial \mathbf{y}}{\partial x}$ \frac{\partial y}{\partial \mathbf{x}} = \left[ \frac{\partial y}{\partial x_1} \ \ \frac{\partial y}{\partial x_2} \ \ \cdots \ \ \frac{\partial y}{\partial x_n} \right].
\end{aligned}$$
The second order derivatives of a scalar $y$ by a vector $\mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \\ \end{bmatrix}$ is written (in numerator layout notation) as
$$\begin{aligned}
\frac{\partial^2 y}{\partial \mathbf{x}\partial \mathbf{x}'}&=\frac{\partial
}{\partial \mathbf{x}'}\left[\frac{\partial y}{\partial \mathbf{x}}\right]=\frac{\partial}{\partial \mathbf{x}'}\left[ \frac{\partial y}{\partial x_1} \ \ \frac{\partial y}{\partial x_2} \ \ \cdots \ \ \frac{\partial y}{\partial x_n} \right] \\&= \begin{bmatrix} \frac{\partial^2 y}{\partial x_1^2} & \frac{\partial^2 y}{\partial x_1\partial x_2} & \cdots & \frac{\partial^2 y}{\partial x_1\partial x_n}\\ \frac{\partial^2 y}{\partial x_2\partial x_1} & \frac{\partial^2 y}{\partial x_2^2} & \cdots & \frac{\partial^2 y}{\partial x_2\partial x_n}\\ \vdots & \vdots & \ddots & \vdots\\ \frac{\partial^2 y}{\partial x_m\partial x_1} & \frac{\partial^2 y}{\partial x_m\partial x_2} & \cdots & \frac{\partial^2 y}{\partial x_m\partial x_m}\\ \end{bmatrix}.
\end{aligned}$$
The derivative of a vector function (a vector whose components are functions) $\mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_m \\ \end{bmatrix}$ , with respect to an input vector, $\mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \\ \end{bmatrix}$ , is written (in numerator layout notation) as
$$\begin{aligned}
\frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} & \cdots & \frac{\partial y_1}{\partial x_n}\\ \frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} & \cdots & \frac{\partial y_2}{\partial x_n}\\ \vdots & \vdots & \ddots & \vdots\\ \frac{\partial y_m}{\partial x_1} & \frac{\partial y_m}{\partial x_2} & \cdots & \frac{\partial y_m}{\partial x_n}\\ \end{bmatrix}.
\end{aligned}$$
\frac{\partial \mathbf{Y}}{\partial x} = \begin{bmatrix} \frac{\partial y_{11}}{\partial x} & \frac{\partial y_{12}}{\partial x} & \cdots & \frac{\partial y_{1n}}{\partial x}\\ \frac{\partial y_{21}}{\partial x} & \frac{\partial y_{22}}{\partial x} & \cdots & \frac{\partial y_{2n}}{\partial x}\\ \vdots & \vdots & \ddots & \vdots\\ \frac{\partial y_{m1}}{\partial x} & \frac{\partial y_{m2}}{\partial x} & \cdots & \frac{\partial y_{mn}}{\partial x}\\ \end{bmatrix}.
\end{aligned}$$
The derivative of a scalar $y$ function of a matrix $X$ of independent variables, with respect to the matrix $X$, is given (in numerator layout notation) by
$$\begin{aligned}
\frac{\partial y}{\partial \mathbf{X}} = \begin{bmatrix} \frac{\partial y}{\partial x_{11}} & \frac{\partial y}{\partial x_{21}} & \cdots & \frac{\partial y}{\partial x_{p1}}\\ \frac{\partial y}{\partial x_{12}} & \frac{\partial y}{\partial x_{22}} & \cdots & \frac{\partial y}{\partial x_{p2}}\\ \vdots & \vdots & \ddots & \vdots\\ \frac{\partial y}{\partial x_{1q}} & \frac{\partial y}{\partial x_{2q}} & \cdots & \frac{\partial y}{\partial x_{pq}}\\ \end{bmatrix}.
\end{aligned}$$
Many rules and tricks for matric derivatives are available from
We can now generalise the update step in Newton’s method: $$x_{n+1}=x_n-\left(\frac{\partial^2 f({\mathbf x})}{\partial {\mathbf x}\partial{\mathbf x}'}\right)^{-1} \frac{\partial f({\mathbf x})}{\partial {\mathbf x}}$$
Write code to minimise $y=x_1^2-x_1x_2+x_2^2+e^{x_2}$
Consider the linear regression model with multiple covariates,
$$y_i = \beta_0 + \beta_1x_{1i}+...+\beta_p x_{pi} + \epsilon_i$$ where $\epsilon_i \sim N(0, \sigma^2)$
What is the (log) likelihood function?
What are the unknown parameters? Let’s consider three situations
When $\beta_0=1$ and $\sigma^2=1$ known.
When $\sigma^2=1$ known.
Neither $\beta_i$ nor $\sigma$ is known.
What is the gradient and Hessian matrix for the log likelihood ($\mathcal{L}$) with respect to the parameter vector $\mathbf{\beta}=(\beta_0,...,\beta_p)'$?
$$\frac{\partial log \mathcal{L}}{\partial \mathbf{\beta}} = ?$$
$$\frac{\partial^2 log \mathcal{L}}{\partial \mathbf{\beta} \partial \mathbf{\beta}'}
= ?$$
Write down the likelihood function with respect to the unknown parameters.
Write down the gradient for the likelihood function.
Write down the Hessian for the likelihood function.
Use Newton’s method to obtain the best parameter estimate.
The idea: using maximum likelihood method with binomial distribution.
One owns a house ($Y=1$) or do not own a house ($Y=0$) can be represented with Bernoulli distribution $$Pr(y;p) = p^y (1-p)^{1-y}\!\quad \text{for }y\in\{0,1\}.$$
The log likelihood function is as follows $$\begin{split}
l(\beta)=& \sum \limits_{n=1}^N\left\{ y_i \log P_i + (1- y_i)
\log (1-P_i) \right\}\\
% =& \sum \limits_{n=1}^N \left\{y_i\beta'x_i -(1-y_i)\log (1+\exp\{1+\beta'x_i\}) \right\}
\end{split}$$ where
$$P_i=\frac{1}{1+\exp(-(\beta_1+\beta_2X_{2i}+...+\beta_pX_{pi}))}$$
Note that the sum of $n$ Bernoulli samples will be binomial distributed.
To obtain $\hat \beta$, use Newton-Raphson algorithm $$\beta^{new} = \beta^{old} - \left(\frac{\partial^2l(\beta)}{\partial
\beta\partial \beta'} \right)^{-1} \frac{\partial
l(\beta)}{\partial\beta} | _{\beta= \beta^{old}}$$
Use Newton’s method to find the maximum likelihood estimators for the coefficients in a logistic regression. The steps are:
Write down likelihood function
Find the gradient and Hessian matrix
Code these up in R/Python
Using the following simulated data to test your code
# Create some example data
set.seed(123)
n <- 1000
x1 <- rnorm(n)
x2 <- rnorm(n)
y <- rbinom(n, 1, plogis(1 + 2*x1 + 3*x2))
# Fit a logistic regression model
model <- glm(y ~ x1 + x2, family = binomial(link = "logit"))
summary(model)
Call: glm(formula = y ~ x1 + x2, family = binomial(link = "logit")) Deviance Residuals: Min 1Q Median 3Q Max -3.00404 -0.25042 0.06014 0.34895 2.97130 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.1877 0.1291 9.199 <2e-16 *** x1 2.1243 0.1768 12.018 <2e-16 *** x2 3.4635 0.2395 14.460 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1329.1 on 999 degrees of freedom Residual deviance: 515.4 on 997 degrees of freedom AIC: 521.4 Number of Fisher Scoring iterations: 7