{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Statistical Modeling with MapReduce\n", "\n", "## Feng Li\n", "\n", "### Central University of Finance and Economics\n", "\n", "### [feng.li@cufe.edu.cn](feng.li@cufe.edu.cn)\n", "### Course home page: [https://feng.li/distcomp](https://feng.li/distcomp)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## What task is suitable for MapReduce?\n", "\n", "\n", "- Hadoop programs are primarily about processing data.\n", "\n", "- **Sorting**: MapReduce implements the sorting algorithm to sort the output key-value pairs from Mapper by their keys. \n", "\n", "- **Searching**: Mapper passes the pattern to search as a distinctive character.\n", "\n", "- **Indexing**: Allocated the position of a pattern\n", "\n", "- **NLP**: TFIDF, Word2Vec, doc2vec...\n", "\n", "- and other independent data processing" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## MapReduce - Fault Tolerance\n", "\n", "- When dealing with large data sets, it is inevitable that some records will have errors.\n", "\n", "- While you should make your program as robust as possible to malformed records, you should also have a recovery mechanism to handle the cases you couldn’t plan for. You don’t want your whole job to fail only because it fails to handle one bad record.\n", "\n", "- Hadoop MapReduce provides a feature for skipping over records that it believes to be crashing a task.\n", "\n", " - A task will enter into skipping mode after the task has been retried several times.\n", " - The TaskTracker will track and determine which record range is causing failure. The TaskTracker will then restart the task but skip over the bad record range.\n", " - You could adjust it with the `JobConf` option together with `mapreduce.map.skip.maxrecords` and `mapreduce.reduce.skip.maxrecords`. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Linear Regression with MapReduce\n", "\n", "\n", "- Assume we have a large dataset. How will we perform regression data analysis now?\n", "\n", "- Hadoop MapReduce for linear regression is possible by implementing Mapper and Reducer.\n", "\n", "- It will divide the dataset into chunks among the available nodes and then they will process the distributed data in parallel.\n", "\n", "- It will not fire memory issues when we run with an R and Hadoop cluster because the large dataset is going to be distributed and processed with R among Hadoop computation nodes.\n", "\n", "- Also, keep in mind that this implemented method does not provide higher prediction accuracy than the `lm()` model." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Linear Regression\n", "\n", "- Assume we have data set contains both $y_{n\\times 1}$ and $X_{n\\times p}$. \n", "\n", "- The linear model $y=X\\beta +\\epsilon$ yields the following solution to $\\widehat \\beta$ \n", "\n", " $ \\hat\\beta = (X'X)^{-1}X'y $\n", " \n", " \n", "- The Big Data problem: $n>>p$\n", "\n", " - The calculations of $X'X$ and $X'y$ is very computational demanding.\n", " - But notice that the final output of $(X'X)_{p\\times p}$ and $(X'y)_{p\\times 1}$ are fairly small.\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Linear Regression: Knocking on wood\n", "\n", "- Let's start with a simple case:\n", "\n", "\\begin{equation}\n", " X'y = \\begin{bmatrix}\n", " x'_{1.},\n", " x'_{2.},\n", " ...,\n", " x'_{n.}\n", " \\end{bmatrix}\n", "y = \\sum_{i=1}^n x_{i.}'y_i\n", " \\end{equation}\n", "\n", "- Then you have\n", "\n", "\\begin{equation}\n", " X'X = \\begin{bmatrix}\n", " x_{1.},\n", " x_{2.},\n", " ...,\n", " x_{n.}\\end{bmatrix} \\times \n", " \\begin{bmatrix}\n", " x'_{1.}\\\\\n", " x'_{2.}\\\\\n", " ...\\\\\n", " x'_{n.}\n", " \\end{bmatrix}\n", " = \\sum_{i=1}^n x_{i.} x'_{i.}\n", " \\end{equation}\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Example code" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Logistic Regression\n", "\n", "- In statistics, logistic regression or logit regression is a type of probabilistic classification model.\n", "\n", "- Logistic regression is used extensively in numerous disciplines, including the medical and social science fields. It can be binomial or multinomial.\n", "\n", "- Binary logistic regression deals with situations in which the outcome for a dependent variable can have two possible types.\n", "\n", "- Multinomial logistic regression deals with situations where the outcome can have three or more possible types.\n", "\n", "- Logistic regression can be implemented using logistic functions.\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Logistic Regression: The Model\n", "\n", "- The logit model connects the explanatory variables in this way\n", "\n", " $P_i=\\frac{1}{1+\\exp(-(\\beta_1+\\beta_2X_i))}$\n", "\n", "- Alternatively we can write the model in this way\n", "\n", " $\\log \\frac{P_i}{1-P_i} = \\beta_1+\\beta_2X_i$\n", " \n", " where $P_i/(1-P_i)$ is called the **odds ratio**: the ratio of probability\n", " of a family will own a house to the probability of not owing a house.\n", "\n", "- This model can be easily estimated with the `glm()` function in R or `sklearn.linear_model.LogisticRegression()` in Python.\n", "\n", "- The logistic regression is different from linear regressions as it does not have analytical solutions" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Logistic Regression with MapReduce\n", "\n", "\n", "- **Bad news**: The above estimation requires sequential iterative method.\n", "\n", "- Will the following hypothetical Hadoop workflow work?\n", "\n", " \n", " Defining the Mapper function\n", " Defining the Reducer function\n", " Defining the Logistic Regression MapReduce function\n", " \n", "\n", "- Logistic regression is the standard industry workhorse that\n", " underlies many production fraud detection and advertising quality\n", " and targeting products. The most common implementations use\n", " Stochastic Gradient Descent (SGD) to all large training sets to be\n", " used. The good news is that it is blazingly fast and thus it is\n", " not a problem for Hadoop implementation to handle training sets of\n", " tens of millions of examples. With the down-sampling typical in\n", " many data-sets, this is equivalent to a dataset with billions of\n", " raw training examples. The ready to use solutions:\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Logistic Regression: the divide and conquer approach\n", "\n", "\n", "- Divide $n$ sample into $k$ blocks that each block consists of $m$ observations。\n", "\n", "- Do logistic regression with each block on a single node.\n", " \n", " $\\widehat \\beta_l = \\arg~\\max \\sum _{i =1}^m \\{ y_{li}x_{li}'\\beta-\\log(1+\\exp\\{x_i'\\beta\\})\\}.$\n", "\n", "- The Full Logistic Regression model with coefficients\n", " \n", " - $\\widehat \\beta$ can be approximated by weighted average of $\\widehat \\beta_l$\n", " \n", " $\n", " \\widehat \\beta = \\frac{1}{k}\\sum_{l=1}^k \\widehat \\beta_l.\n", " $\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Example code" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## So far so good?" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Bash", "language": "bash", "name": "bash" }, "language_info": { "codemirror_mode": "shell", "file_extension": ".sh", "mimetype": "text/x-sh", "name": "bash" } }, "nbformat": 4, "nbformat_minor": 4 }