{
"cells": [
{
"cell_type": "markdown",
"id": "dca1d1ad",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Stochastic Gradient Descent\n",
"\n",
"\n",
"Feng Li\n",
"\n",
"School of Statistics and Mathematics\n",
"\n",
"Central University of Finance and Economics\n",
"\n",
"[feng.li@cufe.edu.cn](mailto:feng.li@cufe.edu.cn)\n",
"\n",
"[https://feng.li/statcomp](https://feng.li/statcomp)\n"
]
},
{
"cell_type": "markdown",
"id": "c24ddeb1",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Quasi-Newton Methods\n",
"\n",
"### Quasi-Newton Methods\n",
"\n",
"- One of the most difficult parts of the Newton method is working out\n",
" the derivatives, especially the Hessian.\n",
"\n",
"- However methods can be used to approximate the Hessian and also the\n",
" gradient.\n",
"\n",
"- These are known as **Quasi-Newton Methods**.\n",
"\n",
"- In general they will converge slower than pure Newton methods.\n"
]
},
{
"cell_type": "markdown",
"id": "10f74714",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### The BFGS algorithm\n",
"\n",
"- The BFGS algorithm was introduced over several papers by Broyden,\n",
" Fletcher, Goldfarb and Shanno.\n",
"\n",
"- It is the most popular Quasi-Newton algorithm.\n",
"\n",
"- The R function ‘optim’ also has a variation called L-BFGS-B.\n",
"\n",
"- The L-BFGS-B uses less computer memory than BFGS and allows for **box\n",
" constraints**."
]
},
{
"cell_type": "markdown",
"id": "ff2ba8b1",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Box Constraints\n",
"\n",
"- Box constraints have the form\n",
"\n",
" $$l_i\\leq x_i \\leq u_i\\quad\\forall i$$\n",
"\n",
"- In statistics this can be very useful. Often parameters are\n",
" constrained\n",
"\n",
" - Variance must be greater than 0\n",
"\n",
" - For a stationary AR(1), coefficient must be between -1 and 1\n",
"\n",
" - Weights in a portfolio must be between 0 and 1 if short selling\n",
" is prohibited.\n"
]
},
{
"cell_type": "markdown",
"id": "a072d85a",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Gradient Descent\n",
"\n",
"- Gradient descent is a **first-order** iterative optimization\n",
" algorithm for finding the minimum of a function.\n",
"\n",
"- To find a **local minimum** of a function using **gradient descent**,\n",
" one takes steps proportional to the negative of the gradient (or\n",
" approximate gradient) of the function at the current point.\n",
"\n",
"- If, instead, one takes steps proportional to the positive of the\n",
" gradient, one approaches a **local maximum** of that function; the\n",
" procedure is then known as **gradient ascent**.\n",
"\n",
"- Gradient descent is based on the observation that if the\n",
" multi-variable function $F(\\mathbf {x} )$ is defined and\n",
" differentiable in $a$ neighborhood of a point $\\mathbf {a}$ , then\n",
" $F(\\mathbf {x} )$decreases **fastest** if one goes from\n",
" $\\mathbf {a}$ in the direction of the negative gradient of $F$ at\n",
" $\\mathbf {a}$ ,$-\\nabla F(\\mathbf {a} )$.\n"
]
},
{
"cell_type": "markdown",
"id": "6f9e930d",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- It follows that, if\n",
" $\\mathbf {a} _{n+1}=\\mathbf {a} _{n}-\\gamma \\nabla F(\\mathbf {a} _{n})$\n",
" for $\\gamma \\in \\mathbb {R} _{+}$ small enough, then\n",
" $F(\\mathbf {a_{n}} )\\geq F(\\mathbf {a_{n+1}} )$.\n",
"\n",
"- Gradient descent is relatively slow close to the **minimum**.\n",
"\n",
"- Conversely, using a fixed small $\\gamma$ can yield poor convergence.\n",
"\n",
"![GD](figures/Gradient_descent.png)\n"
]
},
{
"cell_type": "markdown",
"id": "3f70c627",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Stochastic Gradient Descent\n",
"\n",
"- **Stochastic gradient descent** (often abbreviated SGD) is an iterative\n",
" method for optimizing an objective function with suitable smoothness\n",
" properties (e.g. differentiable or subdifferentiable).\n",
"\n",
"- It is called stochastic because the method uses randomly selected\n",
" (or shuffled) samples to evaluate the gradients, hence SGD can be\n",
" regarded as a stochastic approximation of gradient descent\n",
" optimization."
]
},
{
"cell_type": "markdown",
"id": "4489afc6",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- When used to minimize the function $F(x)$, a standard (or\n",
" \"**batch**\") gradient descent method would perform the following\n",
" iterations :\n",
"\n",
" $$\\begin{aligned}\n",
" w:=w-\\eta \\nabla Q(w)=w-\\eta \\sum _{i=1}^{n}\\nabla Q_{i}(w)/n\n",
" \\end{aligned}$$ where $\\eta$ is a step size (sometimes called\n",
" the **learning rate** in machine learning).\n",
"\n",
"- **Batch**: a subset of a full dataset.\n",
"\n",
"- **Epoch**: Iteration over a full dataset with batches.\n",
"\n",
"- In many cases, the summand functions have a simple form that enables\n",
" inexpensive evaluations of the sum-function and the sum gradient.\n"
]
},
{
"cell_type": "markdown",
"id": "ca023534",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Stochastic gradient descent is a popular algorithm for training a\n",
" wide range of models in machine learning.\n",
"\n",
"- **Averaged stochastic gradient descent** is ordinary stochastic\n",
" gradient descent that records an average of its parameter vector\n",
" over time. That is, the update is the same as for ordinary\n",
" stochastic gradient descent, but the algorithm also keeps track of\n",
"\n",
" $$\\begin{aligned}\n",
" {\\bar {w}}={\\frac {1}{t}}\\sum _{i=0}^{t-1}w_{i}.\n",
" \\end{aligned}$$ When optimization is done, this averaged\n",
" parameter vector takes the place of $w$.\n",
"\n",
"- **Polyak-Ruppert averaging**. Stochastic gradient descent usually does not converge to a fixed value. Instead, it randomly walks with small variance after many epochs. A Polyak-Ruppert averaging is to use the mean of last $k$ steps in an iteration.\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "4981fce4",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Automatic Differencing \n",
"\n",
"- Analytical gradient is off great importance in optimization, but can be difficult to obtain.\n",
"\n",
"- Numerical gradient is not stable and slow in computation.\n",
"\n",
"- An analytical automatic gradient could be computed using second order accurate central differences in the interior points and either first or second order accurate one-sides (forward or backwards) differences at the boundaries.\n",
" \n",
"- Python module `autograd` or `jax` could be directly used for automatic gradient calculation."
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "R",
"language": "R",
"name": "ir"
},
"language_info": {
"codemirror_mode": "r",
"file_extension": ".r",
"mimetype": "text/x-r-source",
"name": "R",
"pygments_lexer": "r",
"version": "4.3.2"
}
},
"nbformat": 4,
"nbformat_minor": 5
}