{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Statistical Modeling with MapReduce\n",
    "\n",
    "## Feng Li\n",
    "\n",
    "### Central University of Finance and Economics\n",
    "\n",
    "### [feng.li@cufe.edu.cn](feng.li@cufe.edu.cn)\n",
    "### Course home page: [https://feng.li/distcomp](https://feng.li/distcomp)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## What task is suitable for MapReduce?\n",
    "\n",
    "\n",
    "- Hadoop programs are primarily about processing data.\n",
    "\n",
    "- **Sorting**: MapReduce implements the sorting algorithm to sort the output key-value pairs from Mapper by their keys.  \n",
    "\n",
    "- **Searching**: Mapper passes the pattern to search as a distinctive character.\n",
    "\n",
    "- **Indexing**: Allocated the position of a pattern\n",
    "\n",
    "- **NLP**: TFIDF, Word2Vec, doc2vec...\n",
    "\n",
    "- and other independent data processing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## MapReduce - Fault Tolerance\n",
    "\n",
    "- When dealing with large data sets, it is inevitable that some records will have errors.\n",
    "\n",
    "- While you should make your program as robust as possible to malformed records, you should also have a recovery mechanism to handle the cases you couldn’t plan for. You don’t want your whole job to fail only because it fails to handle one bad record.\n",
    "\n",
    "- Hadoop MapReduce provides a feature for skipping over records that it believes to be crashing a task.\n",
    "\n",
    "    - A task will enter into skipping mode after the task has been retried several times.\n",
    "    - The TaskTracker will track and determine which record range is causing failure. The TaskTracker will then restart the task but skip over the bad record range.\n",
    "    - You could adjust it with the `JobConf` option together with `mapreduce.map.skip.maxrecords` and `mapreduce.reduce.skip.maxrecords`. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Linear Regression with MapReduce\n",
    "\n",
    "\n",
    "- Assume we have a large dataset. How will we perform regression data analysis now?\n",
    "\n",
    "- Hadoop MapReduce for linear regression is possible by implementing Mapper and Reducer.\n",
    "\n",
    "- It will divide the dataset into chunks among the available nodes and then they will process the distributed data in parallel.\n",
    "\n",
    "- It will not fire memory issues when we run with an R and Hadoop cluster because the large dataset is going to be distributed and processed with R among Hadoop computation nodes.\n",
    "\n",
    "- Also, keep in mind that this implemented method does not provide higher prediction accuracy than the `lm()` model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Linear Regression\n",
    "\n",
    "- Assume we have data set contains both $y_{n\\times 1}$ and $X_{n\\times p}$. \n",
    "\n",
    "- The linear model $y=X\\beta +\\epsilon$ yields the following solution to $\\widehat \\beta$ \n",
    "\n",
    "    $ \\hat\\beta = (X'X)^{-1}X'y $\n",
    "    \n",
    "    \n",
    "-   The Big Data problem: $n>>p$\n",
    "\n",
    "    - The calculations of $X'X$ and $X'y$ is very computational demanding.\n",
    "    - But notice that the final output of $(X'X)_{p\\times p}$ and $(X'y)_{p\\times 1}$ are fairly small.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Linear Regression: Knocking on wood\n",
    "\n",
    "- Let's start with a simple case:\n",
    "\n",
    "\\begin{equation}\n",
    "        X'y = \\begin{bmatrix}\n",
    "          x'_{1.},\n",
    "          x'_{2.},\n",
    "          ...,\n",
    "          x'_{n.}\n",
    "        \\end{bmatrix}\n",
    "y = \\sum_{i=1}^n x_{i.}'y_i\n",
    "      \\end{equation}\n",
    "\n",
    "- Then you have\n",
    "\n",
    "\\begin{equation}\n",
    "        X'X = \\begin{bmatrix}\n",
    "          x_{1.},\n",
    "          x_{2.},\n",
    "          ...,\n",
    "          x_{n.}\\end{bmatrix} \\times \n",
    "          \\begin{bmatrix}\n",
    "          x'_{1.}\\\\\n",
    "          x'_{2.}\\\\\n",
    "          ...\\\\\n",
    "          x'_{n.}\n",
    "          \\end{bmatrix}\n",
    "          = \\sum_{i=1}^n x_{i.}  x'_{i.}\n",
    "      \\end{equation}\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Example code"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Logistic Regression\n",
    "\n",
    "- In statistics, logistic regression or logit regression is a type of probabilistic classification model.\n",
    "\n",
    "- Logistic regression is used extensively in numerous disciplines, including the medical and social science fields. It can be binomial or multinomial.\n",
    "\n",
    "- Binary logistic regression deals with situations in which the outcome for a dependent variable can have two possible types.\n",
    "\n",
    "- Multinomial logistic regression deals with situations where the outcome can have three or more possible types.\n",
    "\n",
    "- Logistic regression can be implemented using logistic functions.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Logistic Regression: The Model\n",
    "\n",
    "- The logit model connects the explanatory variables in this way\n",
    "\n",
    "    $P_i=\\frac{1}{1+\\exp(-(\\beta_1+\\beta_2X_i))}$\n",
    "\n",
    "- Alternatively we can write the model in this way\n",
    "\n",
    "    $\\log \\frac{P_i}{1-P_i} = \\beta_1+\\beta_2X_i$\n",
    "  \n",
    "  where $P_i/(1-P_i)$ is called the **odds ratio**: the ratio of probability\n",
    "    of a family will own a house to the probability of not owing a house.\n",
    "\n",
    "- This model can be easily estimated with the `glm()` function in R or `sklearn.linear_model.LogisticRegression()` in Python.\n",
    "\n",
    "- The logistic regression is different from linear regressions as it does not have analytical solutions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Logistic Regression with MapReduce\n",
    "\n",
    "\n",
    "- **Bad news**: The above estimation requires sequential iterative method.\n",
    "\n",
    "- Will the following hypothetical Hadoop workflow work?\n",
    "\n",
    " \n",
    "        Defining the  Mapper function\n",
    "        Defining the  Reducer function\n",
    "        Defining the Logistic Regression MapReduce function\n",
    " \n",
    "\n",
    "- Logistic regression is the standard industry workhorse that\n",
    "    underlies many production fraud detection and advertising quality\n",
    "    and targeting products. The most common implementations use\n",
    "    Stochastic Gradient Descent (SGD) to all large training sets to be\n",
    "    used. The good news is that it is blazingly fast and thus it is\n",
    "    not a problem for Hadoop implementation to handle training sets of\n",
    "    tens of millions of examples. With the down-sampling typical in\n",
    "    many data-sets, this is equivalent to a dataset with billions of\n",
    "    raw training examples. The ready to use solutions:\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Logistic Regression: the divide and conquer approach\n",
    "\n",
    "\n",
    "- Divide $n$ sample into $k$ blocks that each block consists of $m$ observations。\n",
    "\n",
    "- Do logistic regression with each block on a single node.\n",
    "  \n",
    "  $\\widehat \\beta_l = \\arg~\\max \\sum _{i =1}^m \\{ y_{li}x_{li}'\\beta-\\log(1+\\exp\\{x_i'\\beta\\})\\}.$\n",
    "\n",
    "- The Full Logistic Regression model with coefficients\n",
    "  \n",
    "    - $\\widehat \\beta$ can be approximated by weighted average of $\\widehat \\beta_l$\n",
    "      \n",
    "      $\n",
    "        \\widehat \\beta = \\frac{1}{k}\\sum_{l=1}^k \\widehat \\beta_l.\n",
    "      $\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Example code"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## So far so good?"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Bash",
   "language": "bash",
   "name": "bash"
  },
  "language_info": {
   "codemirror_mode": "shell",
   "file_extension": ".sh",
   "mimetype": "text/x-sh",
   "name": "bash"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}