{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Statistical Modeling with MapReduce\n",
"\n",
"## Feng Li\n",
"\n",
"### Central University of Finance and Economics\n",
"\n",
"### [feng.li@cufe.edu.cn](feng.li@cufe.edu.cn)\n",
"### Course home page: [https://feng.li/distcomp](https://feng.li/distcomp)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## What task is suitable for MapReduce?\n",
"\n",
"\n",
"- Hadoop programs are primarily about processing data.\n",
"\n",
"- **Sorting**: MapReduce implements the sorting algorithm to sort the output key-value pairs from Mapper by their keys. \n",
"\n",
"- **Searching**: Mapper passes the pattern to search as a distinctive character.\n",
"\n",
"- **Indexing**: Allocated the position of a pattern\n",
"\n",
"- **NLP**: TFIDF, Word2Vec, doc2vec...\n",
"\n",
"- and other independent data processing"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## MapReduce - Fault Tolerance\n",
"\n",
"- When dealing with large data sets, it is inevitable that some records will have errors.\n",
"\n",
"- While you should make your program as robust as possible to malformed records, you should also have a recovery mechanism to handle the cases you couldn’t plan for. You don’t want your whole job to fail only because it fails to handle one bad record.\n",
"\n",
"- Hadoop MapReduce provides a feature for skipping over records that it believes to be crashing a task.\n",
"\n",
" - A task will enter into skipping mode after the task has been retried several times.\n",
" - The TaskTracker will track and determine which record range is causing failure. The TaskTracker will then restart the task but skip over the bad record range.\n",
" - You could adjust it with the `JobConf` option together with `mapreduce.map.skip.maxrecords` and `mapreduce.reduce.skip.maxrecords`. "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Linear Regression with MapReduce\n",
"\n",
"\n",
"- Assume we have a large dataset. How will we perform regression data analysis now?\n",
"\n",
"- Hadoop MapReduce for linear regression is possible by implementing Mapper and Reducer.\n",
"\n",
"- It will divide the dataset into chunks among the available nodes and then they will process the distributed data in parallel.\n",
"\n",
"- It will not fire memory issues when we run with an R and Hadoop cluster because the large dataset is going to be distributed and processed with R among Hadoop computation nodes.\n",
"\n",
"- Also, keep in mind that this implemented method does not provide higher prediction accuracy than the `lm()` model."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Linear Regression\n",
"\n",
"- Assume we have data set contains both $y_{n\\times 1}$ and $X_{n\\times p}$. \n",
"\n",
"- The linear model $y=X\\beta +\\epsilon$ yields the following solution to $\\widehat \\beta$ \n",
"\n",
" $ \\hat\\beta = (X'X)^{-1}X'y $\n",
" \n",
" \n",
"- The Big Data problem: $n>>p$\n",
"\n",
" - The calculations of $X'X$ and $X'y$ is very computational demanding.\n",
" - But notice that the final output of $(X'X)_{p\\times p}$ and $(X'y)_{p\\times 1}$ are fairly small.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Linear Regression: Knocking on wood\n",
"\n",
"- Let's start with a simple case:\n",
"\n",
"\\begin{equation}\n",
" X'y = \\begin{bmatrix}\n",
" x'_{1.},\n",
" x'_{2.},\n",
" ...,\n",
" x'_{n.}\n",
" \\end{bmatrix}\n",
"y = \\sum_{i=1}^n x_{i.}'y_i\n",
" \\end{equation}\n",
"\n",
"- Then you have\n",
"\n",
"\\begin{equation}\n",
" X'X = \\begin{bmatrix}\n",
" x_{1.},\n",
" x_{2.},\n",
" ...,\n",
" x_{n.}\\end{bmatrix} \\times \n",
" \\begin{bmatrix}\n",
" x'_{1.}\\\\\n",
" x'_{2.}\\\\\n",
" ...\\\\\n",
" x'_{n.}\n",
" \\end{bmatrix}\n",
" = \\sum_{i=1}^n x_{i.} x'_{i.}\n",
" \\end{equation}\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Example code"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Logistic Regression\n",
"\n",
"- In statistics, logistic regression or logit regression is a type of probabilistic classification model.\n",
"\n",
"- Logistic regression is used extensively in numerous disciplines, including the medical and social science fields. It can be binomial or multinomial.\n",
"\n",
"- Binary logistic regression deals with situations in which the outcome for a dependent variable can have two possible types.\n",
"\n",
"- Multinomial logistic regression deals with situations where the outcome can have three or more possible types.\n",
"\n",
"- Logistic regression can be implemented using logistic functions.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Logistic Regression: The Model\n",
"\n",
"- The logit model connects the explanatory variables in this way\n",
"\n",
" $P_i=\\frac{1}{1+\\exp(-(\\beta_1+\\beta_2X_i))}$\n",
"\n",
"- Alternatively we can write the model in this way\n",
"\n",
" $\\log \\frac{P_i}{1-P_i} = \\beta_1+\\beta_2X_i$\n",
" \n",
" where $P_i/(1-P_i)$ is called the **odds ratio**: the ratio of probability\n",
" of a family will own a house to the probability of not owing a house.\n",
"\n",
"- This model can be easily estimated with the `glm()` function in R or `sklearn.linear_model.LogisticRegression()` in Python.\n",
"\n",
"- The logistic regression is different from linear regressions as it does not have analytical solutions"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Logistic Regression with MapReduce\n",
"\n",
"\n",
"- **Bad news**: The above estimation requires sequential iterative method.\n",
"\n",
"- Will the following hypothetical Hadoop workflow work?\n",
"\n",
" \n",
" Defining the Mapper function\n",
" Defining the Reducer function\n",
" Defining the Logistic Regression MapReduce function\n",
" \n",
"\n",
"- Logistic regression is the standard industry workhorse that\n",
" underlies many production fraud detection and advertising quality\n",
" and targeting products. The most common implementations use\n",
" Stochastic Gradient Descent (SGD) to all large training sets to be\n",
" used. The good news is that it is blazingly fast and thus it is\n",
" not a problem for Hadoop implementation to handle training sets of\n",
" tens of millions of examples. With the down-sampling typical in\n",
" many data-sets, this is equivalent to a dataset with billions of\n",
" raw training examples. The ready to use solutions:\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Logistic Regression: the divide and conquer approach\n",
"\n",
"\n",
"- Divide $n$ sample into $k$ blocks that each block consists of $m$ observations。\n",
"\n",
"- Do logistic regression with each block on a single node.\n",
" \n",
" $\\widehat \\beta_l = \\arg~\\max \\sum _{i =1}^m \\{ y_{li}x_{li}'\\beta-\\log(1+\\exp\\{x_i'\\beta\\})\\}.$\n",
"\n",
"- The Full Logistic Regression model with coefficients\n",
" \n",
" - $\\widehat \\beta$ can be approximated by weighted average of $\\widehat \\beta_l$\n",
" \n",
" $\n",
" \\widehat \\beta = \\frac{1}{k}\\sum_{l=1}^k \\widehat \\beta_l.\n",
" $\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Example code"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## So far so good?"
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "Bash",
"language": "bash",
"name": "bash"
},
"language_info": {
"codemirror_mode": "shell",
"file_extension": ".sh",
"mimetype": "text/x-sh",
"name": "bash"
}
},
"nbformat": 4,
"nbformat_minor": 4
}