{ "cells": [ { "cell_type": "markdown", "id": "c57b2ce5", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Newton's Algorithm\n", "\n", "\n", "Feng Li\n", "\n", "School of Statistics and Mathematics\n", "\n", "Central University of Finance and Economics\n", "\n", "[feng.li@cufe.edu.cn](mailto:feng.li@cufe.edu.cn)\n", "\n", "[https://feng.li/statcomp](https://feng.li/statcomp)\n" ] }, { "cell_type": "markdown", "id": "57470e28", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Finding a maximum/minimum\n", "\n", "- Suppose we want to find an minimum or maximum of a function $f(x)$\n", "\n", "- First order condition: Find the derivative $f'(x)$ and find $x^*$\n", " such that $f'(x^*)=0$\n", "\n", "- This is the same as finding a root of the first derivative. We can\n", " use the Newton Raphson algorithm on the first derivative." ] }, { "cell_type": "markdown", "id": "abe3d30a", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Newton’s algorithm for finding local minima/maxima\n", "\n", "1. Select initial value $x_0$ and set $n=0$\n", "\n", "2. Set $x_{n+1}=x_{n}-\\frac{f'(x_n)}{f''(x_n)}$\n", "\n", "3. Evaluate $|f'(x_{n+1})|$\n", "\n", " - If $|f'(x_{n+1})|<\\epsilon$ then stop.\n", "\n", " - Otherwise set $n=n+1$ and go back to step 2.\n" ] }, { "cell_type": "markdown", "id": "53705a1f", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Different Stopping Rules Three stopping rules can be used\n", "\n", "- $| f'(x_{n}) | \\leq \\epsilon$\n", "\n", "\n", "- $| x_{n}-x_{n-1} | \\leq \\epsilon$\n", "\n", "\n", "- $| f(x_{n})-f(x_{n-1}) | \\leq \\epsilon$" ] }, { "cell_type": "markdown", "id": "2e355bd0", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Intuition\n", "\n", "- Focus the **step size** $-\\frac{f'(x)}{f''(x)}$.\n", "\n", "- The **signs** of the derivatives control the **direction** of the\n", " next step.\n", "\n", "- The **size** of the derivatives control the **size** of the next\n", " step.\n", "\n", "- Consider the concave function $f(x)=-x^4$ which has $f'(x)=-4x^3$\n", " and $f''(x)=-12x^2$. There is a maximum at $x^{*}=0$" ] }, { "cell_type": "markdown", "id": "e92968c0", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Role of first derivative\n", "\n", "- If $f''(x)$ is negative the function is locally **concave**, and the\n", " search is for a local **maximum**\n", "\n", "- To the left of this maximum $f'(x)>0$\n", "\n", "- Therefore $-\\frac{f'(x)}{f''(x)}>0$.\n", "\n", "- The next step is to the right.\n", "\n", "- The reverse holds if $f'(x)<0$\n", "\n", "- Large absolute values of $f'(x)$ imply a steep slope. A big step is\n", " needed to get close to the optimum. The reverse hold for small\n", " absolute value of $f'(x)$.\n" ] }, { "cell_type": "markdown", "id": "b0372336", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- If $f''(x)$ is positive the function is locally **convex**, and the\n", " search is for a local **minimum**\n", "\n", "- To the left of this maximum $f'(x)<0$\n", "\n", "- Therefore $-\\frac{f'(x)}{f''(x)}>0$.\n", "\n", "- The next step is to the right.\n", "\n", "- The reverse holds if $f'(x)>0$\n", "\n", "- Large absolute values of $f'(x)$ imply a steep slope. A big step is\n", " needed to get close to the optimum. The reverse hold for small\n", " absolute value of $f'(x)$." ] }, { "cell_type": "markdown", "id": "a169a62d", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Role of second derivative\n", "\n", "- Together with the sign of the first derivative, the sign of the\n", " second derivative controls the direction of the next step.\n", "\n", "- A larger second derivative (in absolute value) implies a more\n", " curvature\n", "\n", "- In this case smaller steps are need to stop the algorithm from\n", " overshooting.\n", "\n", "- The opposite holds for a small second derivative.\n" ] }, { "cell_type": "markdown", "id": "68ec2be0", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Multidimensional Optimization\n", "\n", "Functions with more than one input\n", "\n", "- Most interesting optimization problems involve **multiple** inputs.\n", "\n", " - In determining the most risk efficient portfolio the return is a\n", " function of many weights (one for each asset).\n", "\n", " - In least squares estimation for a linear regression model, the\n", " sum of squares is a function of many coefficients (one for each\n", " regressor).\n", "\n", "- How do we optimize for functions $f({\\mathbf{ x}})$ where ${\\mathbf{ x}}$ is a\n", " vector?\n", "\n" ] }, { "cell_type": "markdown", "id": "e1a705aa", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- First derivative. Simply take the **partial derivatives** and put them in\n", "a vector $$\\frac{\\partial y}{\\partial{\\mathbf x}}=\n", " \\left(\n", " \\begin{array}{c}\n", " \\frac{\\partial y}{\\partial x_1}\\\\\n", " \\frac{\\partial y}{\\partial x_2}\\\\\n", " \\vdots\\\\\n", " \\frac{\\partial y}{\\partial x_d}\n", " \\end{array}\n", " \\right)$$ This is called the **gradient** vector.\n" ] }, { "cell_type": "markdown", "id": "ef18dbe2", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- An example. The function $$y=x_1^2-x_1x_2+x_2^2+e^{x_2}$$\n", "\n", "Has gradient vector $$\\frac{\\partial y}{\\partial{\\mathbf x}}=\n", " \\left(\n", " \\begin{array}{c}\n", " 2x_1-x_2\\\\\n", " -x_1+2x_2+e^{x_2}\n", " \\end{array}\n", " \\right)$$" ] }, { "cell_type": "markdown", "id": "5780f0b4", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- Second derivative. Simply take the second order **partial derivatives**. This will give a matrix\n", "$$\\frac{\\partial y}{\\partial{\\mathbf x}\\partial{\\mathbf x}'}=\n", " \\left(\n", " \\begin{array}{cccc}\n", " \\frac{\\partial^2 y}{\\partial x_1^2}&\\frac{\\partial^2 y}{\\partial x_1\\partial x_2}&\\cdots&\\frac{\\partial^2 y}{\\partial x_1\\partial x_d}\\\\\n", " \\frac{\\partial^2 y}{\\partial x_2\\partial x_1}&\\frac{\\partial^2 y}{\\partial x_2^2}&\\cdots&\\frac{\\partial^2 y}{\\partial x_2\\partial x_d}\\\\\n", " \\vdots&\\vdots&\\ddots&\\vdots\\\\\n", " \\frac{\\partial^2 y}{\\partial x_d\\partial x_1}&\\frac{\\partial^2 y}{\\partial x_d\\partial x_2}&\\cdots&\\frac{\\partial^2 y}{\\partial x_d^2}\\\\\n", " \\end{array}\n", " \\right)$$ This is called the **Hessian** matrix.\n" ] }, { "cell_type": "markdown", "id": "b22c95d6", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- An example. The function $$y=x_1^2-x_1x_2+x_2^2+e^{x_2}$$ has Hessian matrix $$\\frac{\\partial y}{\\partial{\\mathbf x}\\partial{\\mathbf x}'}=\n", " \\left(\n", " \\begin{array}{cc}\n", " 2 & -1\\\\\n", " -1 & 2 + e^{x_2}\n", " \\end{array}\n", " \\right)$$" ] }, { "cell_type": "markdown", "id": "fe6189f8", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Preliminaries for matrix derivatives\n", "\n", "1. The derivative of a vector\n", " $\\mathbf{y} = \\begin{bmatrix} y_1 \\\\ y_2 \\\\ \\vdots \\\\ y_m \\\\ \\end{bmatrix}$\n", " , by a scalar $x$ is written (in numerator layout notation) as\n", " \\begin{aligned}\n", " \\frac{\\partial \\mathbf{y}}{\\partial x} = \\begin{bmatrix} \\frac{\\partial y_1}{\\partial x}\\\\ \\frac{\\partial y_2}{\\partial x}\\\\ \\vdots\\\\ \\frac{\\partial y_m}{\\partial x}\\\\ \\end{bmatrix}.\n", " \\end{aligned} In vector calculus the derivative of a vector\n", " $y$ with respect to a scalar $x$ is known as the tangent vector of\n", " the vector $y$, $\\frac{\\partial \\mathbf{y}}{\\partial x}$\n" ] }, { "cell_type": "markdown", "id": "34b95415", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "2. **The derivative of a scalar $y$ by a vector**\n", " $\\mathbf{x} = \\begin{bmatrix} x_1 \\\\ x_2 \\\\ \\vdots \\\\ x_n \\\\ \\end{bmatrix}$\n", " , is written (in numerator layout notation) as \\begin{aligned}\n", " \\frac{\\partial y}{\\partial \\mathbf{x}} = \\left[ \\frac{\\partial y}{\\partial x_1} \\ \\ \\frac{\\partial y}{\\partial x_2} \\ \\ \\cdots \\ \\ \\frac{\\partial y}{\\partial x_n} \\right].\n", " \\end{aligned}" ] }, { "cell_type": "markdown", "id": "da7a7db1", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "3. **The second order derivatives of a scalar $y$ by a vector**\n", " $\\mathbf{x} = \\begin{bmatrix} x_1 \\\\ x_2 \\\\ \\vdots \\\\ x_n \\\\ \\end{bmatrix}$\n", " is written (in numerator layout notation) as\n", "\n", " \\begin{aligned}\n", " \\frac{\\partial^2 y}{\\partial \\mathbf{x}\\partial \\mathbf{x}'}&=\\frac{\\partial\n", " }{\\partial \\mathbf{x}'}\\left[\\frac{\\partial y}{\\partial \\mathbf{x}}\\right]=\\frac{\\partial}{\\partial \\mathbf{x}'}\\left[ \\frac{\\partial y}{\\partial x_1} \\ \\ \\frac{\\partial y}{\\partial x_2} \\ \\ \\cdots \\ \\ \\frac{\\partial y}{\\partial x_n} \\right] \\\\&= \\begin{bmatrix} \\frac{\\partial^2 y}{\\partial x_1^2} & \\frac{\\partial^2 y}{\\partial x_1\\partial x_2} & \\cdots & \\frac{\\partial^2 y}{\\partial x_1\\partial x_n}\\\\ \\frac{\\partial^2 y}{\\partial x_2\\partial x_1} & \\frac{\\partial^2 y}{\\partial x_2^2} & \\cdots & \\frac{\\partial^2 y}{\\partial x_2\\partial x_n}\\\\ \\vdots & \\vdots & \\ddots & \\vdots\\\\ \\frac{\\partial^2 y}{\\partial x_m\\partial x_1} & \\frac{\\partial^2 y}{\\partial x_m\\partial x_2} & \\cdots & \\frac{\\partial^2 y}{\\partial x_m\\partial x_m}\\\\ \\end{bmatrix}.\n", " \\end{aligned}" ] }, { "cell_type": "markdown", "id": "9bf2b650", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\n", "4. The derivative of a vector function (a vector whose components are\n", " functions)\n", " $\\mathbf{y} = \\begin{bmatrix} y_1 \\\\ y_2 \\\\ \\vdots \\\\ y_m \\\\ \\end{bmatrix}$\n", " , with respect to an input vector,\n", " $\\mathbf{x} = \\begin{bmatrix} x_1 \\\\ x_2 \\\\ \\vdots \\\\ x_n \\\\ \\end{bmatrix}$\n", " , is written (in numerator layout notation) as\n", "\n", " \\begin{aligned}\n", " \\frac{\\partial \\mathbf{y}}{\\partial \\mathbf{x}} = \\begin{bmatrix} \\frac{\\partial y_1}{\\partial x_1} & \\frac{\\partial y_1}{\\partial x_2} & \\cdots & \\frac{\\partial y_1}{\\partial x_n}\\\\ \\frac{\\partial y_2}{\\partial x_1} & \\frac{\\partial y_2}{\\partial x_2} & \\cdots & \\frac{\\partial y_2}{\\partial x_n}\\\\ \\vdots & \\vdots & \\ddots & \\vdots\\\\ \\frac{\\partial y_m}{\\partial x_1} & \\frac{\\partial y_m}{\\partial x_2} & \\cdots & \\frac{\\partial y_m}{\\partial x_n}\\\\ \\end{bmatrix}.\n", " \\end{aligned}" ] }, { "cell_type": "markdown", "id": "abd8537a", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "5. The derivative of a matrix function $Y$ by a scalar $x$ is known as\n", " the tangent matrix and is given (in numerator layout notation) by\n", " \\begin{aligned}\n", " \\frac{\\partial \\mathbf{Y}}{\\partial x} = \\begin{bmatrix} \\frac{\\partial y_{11}}{\\partial x} & \\frac{\\partial y_{12}}{\\partial x} & \\cdots & \\frac{\\partial y_{1n}}{\\partial x}\\\\ \\frac{\\partial y_{21}}{\\partial x} & \\frac{\\partial y_{22}}{\\partial x} & \\cdots & \\frac{\\partial y_{2n}}{\\partial x}\\\\ \\vdots & \\vdots & \\ddots & \\vdots\\\\ \\frac{\\partial y_{m1}}{\\partial x} & \\frac{\\partial y_{m2}}{\\partial x} & \\cdots & \\frac{\\partial y_{mn}}{\\partial x}\\\\ \\end{bmatrix}.\n", " \\end{aligned}\n" ] }, { "cell_type": "markdown", "id": "89ac88bf", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "6. The derivative of a scalar $y$ function of a matrix $X$ of\n", " independent variables, with respect to the matrix $X$, is given (in\n", " numerator layout notation) by\n", "\n", " \\begin{aligned}\n", " \\frac{\\partial y}{\\partial \\mathbf{X}} = \\begin{bmatrix} \\frac{\\partial y}{\\partial x_{11}} & \\frac{\\partial y}{\\partial x_{21}} & \\cdots & \\frac{\\partial y}{\\partial x_{p1}}\\\\ \\frac{\\partial y}{\\partial x_{12}} & \\frac{\\partial y}{\\partial x_{22}} & \\cdots & \\frac{\\partial y}{\\partial x_{p2}}\\\\ \\vdots & \\vdots & \\ddots & \\vdots\\\\ \\frac{\\partial y}{\\partial x_{1q}} & \\frac{\\partial y}{\\partial x_{2q}} & \\cdots & \\frac{\\partial y}{\\partial x_{pq}}\\\\ \\end{bmatrix}.\n", " \\end{aligned}" ] }, { "cell_type": "markdown", "id": "add56f28", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "7. Many rules and tricks for matric derivatives are available from \n", "\n", " [Helmut Lütkepohl (1996), Handbook of Matrices, Chapter 10](../Literature/Matrix-Derivatives.pdf)" ] }, { "cell_type": "markdown", "id": "06b8f779", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Newton’s algorithm for multidimensional optimization \n", "\n", "We can now generalise the update step in Newton’s method:\n", "$$x_{n+1}=x_n-\\left(\\frac{\\partial^2 f({\\mathbf x})}{\\partial {\\mathbf x}\\partial{\\mathbf x}'}\\right)^{-1}\n", " \\frac{\\partial f({\\mathbf x})}{\\partial {\\mathbf x}}$$" ] }, { "cell_type": "markdown", "id": "103057f3", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Lab\n", "\n", "Write code to minimise $y=x_1^2-x_1x_2+x_2^2+e^{x_2}$\n" ] }, { "cell_type": "markdown", "id": "723dc0c7", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## The linear regression model, a revisit\n", "\n", "- Consider the linear regression model with multiple covariates,\n", "\n", " $$y_i = \\beta_0 + \\beta_1x_{1i}+...+\\beta_p x_{pi} + \\epsilon_i$$ where\n", " $\\epsilon_i \\sim N(0, \\sigma^2)$\n", "\n", "- What is the (log) likelihood function?\n", "\n", "- What are the unknown parameters? Let’s consider three situations\n", "\n", " - When $\\beta_0=1$ and $\\sigma^2=1$ known.\n", "\n", " - When $\\sigma^2=1$ known.\n", "\n", " - Neither $\\beta_i$ nor $\\sigma$ is known.\n" ] }, { "cell_type": "markdown", "id": "e6b48c26", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Maximum likelihood Estimate for linear models\n", "\n", "- What is the gradient and Hessian matrix for the log likelihood\n", " ($\\mathcal{L}$) with respect to the parameter vector\n", " $\\mathbf{\\beta}=(\\beta_0,...,\\beta_p)'$?\n", "\n", " $$\\frac{\\partial log \\mathcal{L}}{\\partial \\mathbf{\\beta}} = ?$$\n", "\n", " $$\\frac{\\partial^2 log \\mathcal{L}}{\\partial \\mathbf{\\beta} \\partial \\mathbf{\\beta}'}\n", " = ?$$\n", "\n" ] }, { "cell_type": "markdown", "id": "c5870905", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- Write down the likelihood function with respect to the unknown\n", " parameters.\n", "\n", "- Write down the gradient for the likelihood function.\n", "\n", "- Write down the Hessian for the likelihood function.\n", "\n", "- Use Newton’s method to obtain the best parameter estimate." ] }, { "cell_type": "markdown", "id": "663d2139", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Vector Newton-Raphson Algorithm: The logit model\n", "\n", "- **The idea**: using maximum likelihood method with binomial\n", " distribution.\n", "\n", "- One owns a house ($Y=1$) or do not own a house ($Y=0$) can be\n", " represented with **Bernoulli distribution**\n", " $$Pr(y;p) = p^y (1-p)^{1-y}\\!\\quad \\text{for }y\\in\\{0,1\\}.$$\n" ] }, { "cell_type": "markdown", "id": "3c89b368", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- The log likelihood function is as follows $$\\begin{split}\n", " l(\\beta)=& \\sum \\limits_{n=1}^N\\left\\{ y_i \\log P_i + (1- y_i)\n", " \\log (1-P_i) \\right\\}\\\\\n", " % =& \\sum \\limits_{n=1}^N \\left\\{y_i\\beta'x_i -(1-y_i)\\log (1+\\exp\\{1+\\beta'x_i\\}) \\right\\}\n", " \\end{split}$$ where\n", " $$P_i=\\frac{1}{1+\\exp(-(\\beta_1+\\beta_2X_{2i}+...+\\beta_pX_{pi}))}$$\n", "\n", "- Note that the sum of $n$ Bernoulli samples will be **binomial**\n", " distributed.\n", "\n", "- To obtain $\\hat \\beta$, use Newton-Raphson algorithm\n", " $$\\beta^{new} = \\beta^{old} - \\left(\\frac{\\partial^2l(\\beta)}{\\partial\n", " \\beta\\partial \\beta'} \\right)^{-1} \\frac{\\partial\n", " l(\\beta)}{\\partial\\beta} | _{\\beta= \\beta^{old}}$$\n" ] }, { "cell_type": "markdown", "id": "1ff3f1b4", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Let's code them up\n", "\n", "- Use Newton’s method to find the maximum likelihood estimators for the\n", " coefficients in a logistic regression. The steps are:\n", "\n", " - Write down likelihood function\n", "\n", " - Find the gradient and Hessian matrix\n", "\n", " - Code these up in R/Python\n", " \n", " - Using the following simulated data to test your code" ] }, { "cell_type": "code", "execution_count": 7, "id": "178d48ed", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "\n", "Call:\n", "glm(formula = y ~ x1 + x2, family = binomial(link = \"logit\"))\n", "\n", "Deviance Residuals: \n", " Min 1Q Median 3Q Max \n", "-3.00404 -0.25042 0.06014 0.34895 2.97130 \n", "\n", "Coefficients:\n", " Estimate Std. Error z value Pr(>|z|) \n", "(Intercept) 1.1877 0.1291 9.199 <2e-16 ***\n", "x1 2.1243 0.1768 12.018 <2e-16 ***\n", "x2 3.4635 0.2395 14.460 <2e-16 ***\n", "---\n", "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n", "\n", "(Dispersion parameter for binomial family taken to be 1)\n", "\n", " Null deviance: 1329.1 on 999 degrees of freedom\n", "Residual deviance: 515.4 on 997 degrees of freedom\n", "AIC: 521.4\n", "\n", "Number of Fisher Scoring iterations: 7\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Create some example data\n", "set.seed(123)\n", "n <- 1000\n", "x1 <- rnorm(n)\n", "x2 <- rnorm(n)\n", "y <- rbinom(n, 1, plogis(1 + 2*x1 + 3*x2))\n", "\n", "# Fit a logistic regression model\n", "model <- glm(y ~ x1 + x2, family = binomial(link = \"logit\"))\n", "summary(model)" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "R", "language": "R", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "4.3.2" } }, "nbformat": 4, "nbformat_minor": 5 }