{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "dca1d1ad",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Stochastic Gradient Descent\n",
    "\n",
    "\n",
    "Feng Li\n",
    "\n",
    "School of Statistics and Mathematics\n",
    "\n",
    "Central University of Finance and Economics\n",
    "\n",
    "[feng.li@cufe.edu.cn](mailto:feng.li@cufe.edu.cn)\n",
    "\n",
    "[https://feng.li/statcomp](https://feng.li/statcomp)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c24ddeb1",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Quasi-Newton Methods\n",
    "\n",
    "### Quasi-Newton Methods\n",
    "\n",
    "-   One of the most difficult parts of the Newton method is working out\n",
    "    the derivatives, especially the Hessian.\n",
    "\n",
    "-   However methods can be used to approximate the Hessian and also the\n",
    "    gradient.\n",
    "\n",
    "-   These are known as **Quasi-Newton Methods**.\n",
    "\n",
    "-   In general they will converge slower than pure Newton methods.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "10f74714",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### The BFGS algorithm\n",
    "\n",
    "-   The BFGS algorithm was introduced over several papers by Broyden,\n",
    "    Fletcher, Goldfarb and Shanno.\n",
    "\n",
    "-   It is the most popular Quasi-Newton algorithm.\n",
    "\n",
    "-   The R function ‘optim’ also has a variation called L-BFGS-B.\n",
    "\n",
    "-   The L-BFGS-B uses less computer memory than BFGS and allows for **box\n",
    "    constraints**."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ff2ba8b1",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Box Constraints\n",
    "\n",
    "-   Box constraints have the form\n",
    "\n",
    "    $$l_i\\leq x_i \\leq u_i\\quad\\forall i$$\n",
    "\n",
    "-   In statistics this can be very useful. Often parameters are\n",
    "    constrained\n",
    "\n",
    "    -   Variance must be greater than 0\n",
    "\n",
    "    -   For a stationary AR(1), coefficient must be between -1 and 1\n",
    "\n",
    "    -   Weights in a portfolio must be between 0 and 1 if short selling\n",
    "        is prohibited.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a072d85a",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Gradient Descent\n",
    "\n",
    "-   Gradient descent is a **first-order** iterative optimization\n",
    "    algorithm for finding the minimum of a function.\n",
    "\n",
    "-   To find a **local minimum** of a function using **gradient descent**,\n",
    "    one takes steps proportional to the negative of the gradient (or\n",
    "    approximate gradient) of the function at the current point.\n",
    "\n",
    "-   If, instead, one takes steps proportional to the positive of the\n",
    "    gradient, one approaches a **local maximum** of that function; the\n",
    "    procedure is then known as **gradient ascent**.\n",
    "\n",
    "-   Gradient descent is based on the observation that if the\n",
    "    multi-variable function $F(\\mathbf {x} )$ is defined and\n",
    "    differentiable in $a$ neighborhood of a point $\\mathbf {a}$ , then\n",
    "    $F(\\mathbf {x} )$decreases **fastest** if one goes from\n",
    "    $\\mathbf {a}$ in the direction of the negative gradient of $F$ at\n",
    "    $\\mathbf {a}$ ,$-\\nabla F(\\mathbf {a} )$.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6f9e930d",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "-   It follows that, if\n",
    "    $\\mathbf {a} _{n+1}=\\mathbf {a} _{n}-\\gamma \\nabla F(\\mathbf {a} _{n})$\n",
    "    for $\\gamma \\in \\mathbb {R} _{+}$ small enough, then\n",
    "    $F(\\mathbf {a_{n}} )\\geq F(\\mathbf {a_{n+1}} )$.\n",
    "\n",
    "-   Gradient descent is relatively slow close to the **minimum**.\n",
    "\n",
    "-   Conversely, using a fixed small $\\gamma$ can yield poor convergence.\n",
    "\n",
    "![GD](figures/Gradient_descent.png)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3f70c627",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Stochastic Gradient Descent\n",
    "\n",
    "-   **Stochastic gradient descent** (often abbreviated SGD) is an iterative\n",
    "    method for optimizing an objective function with suitable smoothness\n",
    "    properties (e.g. differentiable or subdifferentiable).\n",
    "\n",
    "-   It is called stochastic because the method uses randomly selected\n",
    "    (or shuffled) samples to evaluate the gradients, hence SGD can be\n",
    "    regarded as a stochastic approximation of gradient descent\n",
    "    optimization."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4489afc6",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "-   When used to minimize the function $F(x)$, a standard (or\n",
    "    \"**batch**\") gradient descent method would perform the following\n",
    "    iterations :\n",
    "\n",
    "    $$\\begin{aligned}\n",
    "        w:=w-\\eta \\nabla Q(w)=w-\\eta \\sum _{i=1}^{n}\\nabla Q_{i}(w)/n\n",
    "        \\end{aligned}$$ where $\\eta$ is a step size (sometimes called\n",
    "    the **learning rate** in machine learning).\n",
    "\n",
    "- **Batch**: a subset of a full dataset.\n",
    "\n",
    "- **Epoch**: Iteration over a full dataset with batches.\n",
    "\n",
    "-   In many cases, the summand functions have a simple form that enables\n",
    "    inexpensive evaluations of the sum-function and the sum gradient.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ca023534",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "-   Stochastic gradient descent is a popular algorithm for training a\n",
    "    wide range of models in machine learning.\n",
    "\n",
    "-   **Averaged stochastic gradient descent** is ordinary stochastic\n",
    "    gradient descent that records an average of its parameter vector\n",
    "    over time. That is, the update is the same as for ordinary\n",
    "    stochastic gradient descent, but the algorithm also keeps track of\n",
    "\n",
    "    $$\\begin{aligned}\n",
    "        {\\bar {w}}={\\frac {1}{t}}\\sum _{i=0}^{t-1}w_{i}.\n",
    "        \\end{aligned}$$ When optimization is done, this averaged\n",
    "    parameter vector takes the place of $w$.\n",
    "\n",
    "- **Polyak-Ruppert averaging**. Stochastic gradient descent usually does not converge to a fixed value. Instead, it randomly walks with small variance after many epochs. A Polyak-Ruppert averaging is to use the mean of  last $k$ steps in an iteration.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4981fce4",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Automatic Differencing \n",
    "\n",
    "- Analytical gradient is off great importance in optimization, but can be difficult to obtain.\n",
    "\n",
    "- Numerical gradient is not stable and slow in computation.\n",
    "\n",
    "- An analytical automatic gradient could be computed using second order accurate central differences in the interior points and either first or second order accurate one-sides (forward or backwards) differences at the boundaries.\n",
    "    \n",
    "- Python module `autograd` or `jax` could be directly used for automatic gradient calculation."
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "R",
   "language": "R",
   "name": "ir"
  },
  "language_info": {
   "codemirror_mode": "r",
   "file_extension": ".r",
   "mimetype": "text/x-r-source",
   "name": "R",
   "pygments_lexer": "r",
   "version": "4.3.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}