Feng Li
School of Statistics and Mathematics
Central University of Finance and Economics
A distributed systems focuses on moving code to data.
The clients send only the programs to be executed, and these programs are usually small.
More importantly, data are broken up and distributed across the cluster, and as much as possible, computation on a piece of data takes place on the same machine where that piece of data resides.
The whole process is known as MapReduce.
Interpretable statistical models such as GLM and Time Series Forecasting Models.
Efficient Bayesian inference tools such as MCMC, Gibbs and Variational Inference.
Distributed statistical visualization tools like ggplot2
, seaborn
and plotly
...
-- Especially for statisticians
No unified solutions to deploy conventional statistical methods to distributed computing platform.
Steady learning curve for using distributed computing.
Could not balance between estimator efficiency and communication cost.
Unrealistic models assumptions, e.g. requiring data randomly distributed.
User-Defined Functions (UDFs) are a feature of Spark that allows users to define their own functions when the system's built-in functions are not enough to perform the desired task.
The API is available in Spark (>= 2.3).
It runs with PySpark (requiring Apache Arrow
) and Scala.
MLlib uses linear algebra packages Breeze
, dev.ludovic.netlib
, and netlib-java
for optimized numerical processing.
Only available in Scala.
Steady learning curve.
ml.linalg.
Matrix()
, DenseMatrix()
, SparseMatrix()
mllib.linalg.
SingularValueDecomposition()
, QRDecomposition()
mllib.linalg.distributed.
BlockMatrix()
, CoordinateMatrix()
, IndexedRow()
, IndexedRowMatrix()
, RowMatrix()
mllib.optimization.
LBFGS()
, GradientDescent()
mllib.random.
GammaGenerator()
, LogNormalGenerator()
, PoissonGenerator()
, StandardNormalGenerator()
, UniformGenerator()
, WeibullGenerator()
, ExponentialGenerator()
mllib.stat.distribution.
MultivariateGaussian()
Code available at https://github.com/feng-li/dstats
in Journal of Computational and Graphical Statistics, 2021 (with Xuening Zhu & Hansheng Wang) https://doi.org/10.1080/10618600.2021.1923517
We estimate the parameter $\theta$ on each worker separately by using local data on distributed workers. This can be done efficiently by using standard statistical estimation methods (e.g., maximum likelihood estimation).
Each worker passes the local estimator of $\theta$ and its asymptotic covariance estimate to the master.
A weighted least squares-type objective function can be constructed. This can be viewed as a local quadratic approximation of the global log-likelihood functions.
Efficiency and cost effectiveness
A standard industrial-level architecture Spark-on-YARN cluster on the Alibaba cloud server consists of one master node and two worker nodes. Each node contains 64 virtual cores, 64 GB of RAM and two 80 GB SSD local hard drives. (cost 300 RMB per day}.
We find that $26.2$ minutes are needed for DLSA.
The traditional MLE takes more that $15$ hours and obtains an inferior result (cost 187 RMB).
That means we have saved 97% computational power. (cost only 6 RMB).
in Journal of Business and Economic Statistics, 2021 (with Rui Pan, Tunan Ren, Baishan Guo, Guodong Li & Hansheng Wang) https://doi.org/10.1080/07350015.2021.1961789
We conduct a random sampling of size $n$ from the distributed system, where $n$ is much smaller than the whole sample size $N$.
Thereafter, a standard quantile regression estimator can be obtained on the master, which is referred to as the pilot estimator.
To further enhance the statistical efficiency, we propose a one-step Newton-Raphson type algorithm to upgrade the pilot estimator.
in arXiv:2007.09577 (with Xiaoqian Wang, Yanfei Kang and Rob J Hyndman)
We develop a novel distributed forecasting framework to tackle challenges associated with forecasting ultra-long time series.
The proposed model combination approach facilitates distributed time series forecasting by combining the local estimators of time series models delivered from worker nodes and minimizing a global loss function.
In this way, instead of unrealistically assuming the data generating process (DGP) of an ultra-long time series stays invariant, we make assumptions only on the DGP of subseries spanning shorter time periods.
Distributed modeling, computing and visualization are the future of statistics.
Spark is not the only software for distributed statistical computing,
But is the easiest one.