Developing Distributed Models with Spark¶

Feng Li

School of Statistics and Mathematics

Central University of Finance and Economics

feng.li@cufe.edu.cn

https://feng.li/

Outline

  • The move-code-to-data philosophy
  • What do we have with Spark?
  • What do we (statisticians) miss with distributed platforms?
  • Why is it difficult to develop statistical models on distributed systems?
  • Spark APIs for statisticians to develop distributed models
    • UDFs for DataFrames-based API
    • RDD API with linear algebra support
      • Linear algebra and optimization
      • Random variable generator and distribution
  • Real projects on Spark
    • DLSA: Least squares approximation for a distributed system
    • Distributed quantile regression by pilot sampling and one-step updating.
    • Distributed ARIMA models for ultra-long time series
  • Take home message

The move-code-to-data philosophy¶

  • The traditional supercomputer requires repeat transmissions of data between clients and servers. This works fine for computationally intensive work, but for data-intensive processing, the size of data becomes too large to be moved around easily.

  • A distributed systems focuses on moving code to data.

  • The clients send only the programs to be executed, and these programs are usually small.

  • More importantly, data are broken up and distributed across the cluster, and as much as possible, computation on a piece of data takes place on the same machine where that piece of data resides.

  • The whole process is known as MapReduce.

MapReduce

Spark-GitHub

ms

What do we have with Spark?¶

Spark-ML

What do we ( statisticians ) miss with distributed platforms?¶

  • Interpretable statistical models such as GLM and Time Series Forecasting Models.

  • Efficient Bayesian inference tools such as MCMC, Gibbs and Variational Inference.

  • Distributed statistical visualization tools like ggplot2, seaborn and plotly

  • ...

Why is it difficult to develop statistical models on distributed systems?¶

-- Especially for statisticians

  • No unified solutions to deploy conventional statistical methods to distributed computing platform.

  • Steady learning curve for using distributed computing.

  • Could not balance between estimator efficiency and communication cost.

  • Unrealistic models assumptions, e.g. requiring data randomly distributed.

Spark APIs for statisticians to develop distributed models¶

UDFs for DataFrames-based API¶

  • User-Defined Functions (UDFs) are a feature of Spark that allows users to define their own functions when the system's built-in functions are not enough to perform the desired task.

  • The API is available in Spark (>= 2.3).

  • It runs with PySpark (requiring Apache Arrow) and Scala.

RDD API with linear algebra support¶

  • MLlib uses linear algebra packages Breeze, dev.ludovic.netlib, and netlib-java for optimized numerical processing.

  • Only available in Scala.

  • Steady learning curve.

Linear algebra and optimization¶

  • ml.linalg. Matrix(), DenseMatrix(), SparseMatrix()

  • mllib.linalg. SingularValueDecomposition(), QRDecomposition()

  • mllib.linalg.distributed. BlockMatrix(), CoordinateMatrix(), IndexedRow(), IndexedRowMatrix(), RowMatrix()

  • mllib.optimization. LBFGS(), GradientDescent()

Random variable generator and distribution¶

  • mllib.random. GammaGenerator(), LogNormalGenerator(), PoissonGenerator(), StandardNormalGenerator(), UniformGenerator(), WeibullGenerator(), ExponentialGenerator()

  • mllib.stat.distribution. MultivariateGaussian()

Real projects on Spark¶

Code available at https://github.com/feng-li/dstats

DLSA: Least squares approximation for a distributed system¶

in Journal of Computational and Graphical Statistics, 2021 (with Xuening Zhu & Hansheng Wang) https://doi.org/10.1080/10618600.2021.1923517

  • We estimate the parameter $\theta$ on each worker separately by using local data on distributed workers. This can be done efficiently by using standard statistical estimation methods (e.g., maximum likelihood estimation).

  • Each worker passes the local estimator of $\theta$ and its asymptotic covariance estimate to the master.

  • A weighted least squares-type objective function can be constructed. This can be viewed as a local quadratic approximation of the global log-likelihood functions.

Efficiency and cost effectiveness

  • A standard industrial-level architecture Spark-on-YARN cluster on the Alibaba cloud server consists of one master node and two worker nodes. Each node contains 64 virtual cores, 64 GB of RAM and two 80 GB SSD local hard drives. (cost 300 RMB per day}.

  • We find that $26.2$ minutes are needed for DLSA.

  • The traditional MLE takes more that $15$ hours and obtains an inferior result (cost 187 RMB).

  • That means we have saved 97% computational power. (cost only 6 RMB).

Distributed quantile regression by pilot sampling and one-step updating¶

in Journal of Business and Economic Statistics, 2021 (with Rui Pan, Tunan Ren, Baishan Guo, Guodong Li & Hansheng Wang) https://doi.org/10.1080/07350015.2021.1961789

  • We conduct a random sampling of size $n$ from the distributed system, where $n$ is much smaller than the whole sample size $N$.

  • Thereafter, a standard quantile regression estimator can be obtained on the master, which is referred to as the pilot estimator.

  • To further enhance the statistical efficiency, we propose a one-step Newton-Raphson type algorithm to upgrade the pilot estimator.

Distributed ARIMA models for ultra-long time series¶

in arXiv:2007.09577 (with Xiaoqian Wang, Yanfei Kang and Rob J Hyndman)

  • We develop a novel distributed forecasting framework to tackle challenges associated with forecasting ultra-long time series.

  • The proposed model combination approach facilitates distributed time series forecasting by combining the local estimators of time series models delivered from worker nodes and minimizing a global loss function.

  • In this way, instead of unrealistically assuming the data generating process (DGP) of an ultra-long time series stays invariant, we make assumptions only on the DGP of subseries spanning shorter time periods.

DARIMA

Take home message¶

  • Distributed modeling, computing and visualization are the future of statistics.

  • Spark is not the only software for distributed statistical computing,

  • But is the easiest one.