Statistical Modeling with MapReduce¶

Feng Li¶

Central University of Finance and Economics¶

feng.li@cufe.edu.cn¶

Course home page: https://feng.li/distcomp¶

What task is suitable for MapReduce?¶

  • Hadoop programs are primarily about processing data.

  • Sorting: MapReduce implements the sorting algorithm to sort the output key-value pairs from Mapper by their keys.

  • Searching: Mapper passes the pattern to search as a distinctive character.

  • Indexing: Allocated the position of a pattern

  • NLP: TFIDF, Word2Vec, doc2vec...

  • and other independent data processing

MapReduce - Fault Tolerance¶

  • When dealing with large data sets, it is inevitable that some records will have errors.

  • While you should make your program as robust as possible to malformed records, you should also have a recovery mechanism to handle the cases you couldn’t plan for. You don’t want your whole job to fail only because it fails to handle one bad record.

  • Hadoop MapReduce provides a feature for skipping over records that it believes to be crashing a task.

    • A task will enter into skipping mode after the task has been retried several times.
    • The TaskTracker will track and determine which record range is causing failure. The TaskTracker will then restart the task but skip over the bad record range.
    • You could adjust it with the JobConf option together with mapreduce.map.skip.maxrecords and mapreduce.reduce.skip.maxrecords.

Linear Regression with MapReduce¶

  • Assume we have a large dataset. How will we perform regression data analysis now?

  • Hadoop MapReduce for linear regression is possible by implementing Mapper and Reducer.

  • It will divide the dataset into chunks among the available nodes and then they will process the distributed data in parallel.

  • It will not fire memory issues when we run with an R and Hadoop cluster because the large dataset is going to be distributed and processed with R among Hadoop computation nodes.

  • Also, keep in mind that this implemented method does not provide higher prediction accuracy than the lm() model.

Linear Regression¶

  • Assume we have data set contains both $y_{n\times 1}$ and $X_{n\times p}$.

  • The linear model $y=X\beta +\epsilon$ yields the following solution to $\widehat \beta$

    $ \hat\beta = (X'X)^{-1}X'y $

  • The Big Data problem: $n>>p$

    • The calculations of $X'X$ and $X'y$ is very computational demanding.
    • But notice that the final output of $(X'X)_{p\times p}$ and $(X'y)_{p\times 1}$ are fairly small.

Linear Regression: Knocking on wood¶

  • Let's start with a simple case:
\begin{equation} X'y = \begin{bmatrix} x'_{1.}, x'_{2.}, ..., x'_{n.} \end{bmatrix} y = \sum_{i=1}^n x_{i.}'y_i \end{equation}
  • Then you have
\begin{equation} X'X = \begin{bmatrix} x_{1.}, x_{2.}, ..., x_{n.}\end{bmatrix} \times \begin{bmatrix} x'_{1.}\\ x'_{2.}\\ ...\\ x'_{n.} \end{bmatrix} = \sum_{i=1}^n x_{i.} x'_{i.} \end{equation}

Example code¶

Logistic Regression¶

  • In statistics, logistic regression or logit regression is a type of probabilistic classification model.

  • Logistic regression is used extensively in numerous disciplines, including the medical and social science fields. It can be binomial or multinomial.

  • Binary logistic regression deals with situations in which the outcome for a dependent variable can have two possible types.

  • Multinomial logistic regression deals with situations where the outcome can have three or more possible types.

  • Logistic regression can be implemented using logistic functions.

Logistic Regression: The Model¶

  • The logit model connects the explanatory variables in this way

    $P_i=\frac{1}{1+\exp(-(\beta_1+\beta_2X_i))}$

  • Alternatively we can write the model in this way

    $\log \frac{P_i}{1-P_i} = \beta_1+\beta_2X_i$

    where $P_i/(1-P_i)$ is called the odds ratio: the ratio of probability of a family will own a house to the probability of not owing a house.

  • This model can be easily estimated with the glm() function in R or sklearn.linear_model.LogisticRegression() in Python.

  • The logistic regression is different from linear regressions as it does not have analytical solutions

Logistic Regression with MapReduce¶

  • Bad news: The above estimation requires sequential iterative method.

  • Will the following hypothetical Hadoop workflow work?

      Defining the  Mapper function
      Defining the  Reducer function
      Defining the Logistic Regression MapReduce function
    
    
    
  • Logistic regression is the standard industry workhorse that underlies many production fraud detection and advertising quality and targeting products. The most common implementations use Stochastic Gradient Descent (SGD) to all large training sets to be used. The good news is that it is blazingly fast and thus it is not a problem for Hadoop implementation to handle training sets of tens of millions of examples. With the down-sampling typical in many data-sets, this is equivalent to a dataset with billions of raw training examples. The ready to use solutions:

Logistic Regression: the divide and conquer approach¶

  • Divide $n$ sample into $k$ blocks that each block consists of $m$ observations。

  • Do logistic regression with each block on a single node.

    $\widehat \beta_l = \arg~\max \sum _{i =1}^m \{ y_{li}x_{li}'\beta-\log(1+\exp\{x_i'\beta\})\}.$

  • The Full Logistic Regression model with coefficients

    • $\widehat \beta$ can be approximated by weighted average of $\widehat \beta_l$

      $ \widehat \beta = \frac{1}{k}\sum_{l=1}^k \widehat \beta_l. $

Example code¶

So far so good?¶