Developing Distributed Models with Spark

Feng Li

School of Statistics and Mathematics

Central University of Finance and Economics

feng.li@cufe.edu.cn

https://feng.li/

Outline

The move-code-to-data philosophy

MapReduce

Spark-GitHub

ms

What do we have with Spark?

Spark-ML

What do we ( statisticians ) miss with distributed platforms?

Why is it difficult to develop statistical models on distributed systems?

-- Especially for statisticians

Spark APIs for statisticians to develop distributed models

UDFs for DataFrames-based API

RDD API with linear algebra support

Linear algebra and optimization

Random variable generator and distribution

Real projects on Spark

Code available at https://github.com/feng-li/dstats

DLSA: Least squares approximation for a distributed system

in Journal of Computational and Graphical Statistics, 2021 (with Xuening Zhu & Hansheng Wang) https://doi.org/10.1080/10618600.2021.1923517

Efficiency and cost effectiveness

Distributed quantile regression by pilot sampling and one-step updating

in Journal of Business and Economic Statistics, 2021 (with Rui Pan, Tunan Ren, Baishan Guo, Guodong Li & Hansheng Wang) https://doi.org/10.1080/07350015.2021.1961789

Distributed ARIMA models for ultra-long time series

in arXiv:2007.09577 (with Xiaoqian Wang, Yanfei Kang and Rob J Hyndman)

DARIMA

Take home message