Introduction to forecasting computation¶

Feng Li¶

Guanghua School of Management¶

Peking University¶

feng.li@gsm.pku.edu.cn ¶

Course home page: https://feng.li/bdcf ¶

Speed up computations with many CPUs and GPUs¶

GPU parallel computing and distributed computing both focus on speeding up computations, but they do so in different ways and are suited for different types of problems.

GPU parallel computing uses a single machine with one or more GPUs (Graphics Processing Units) to perform many calculations simultaneously. GPUs have thousands of cores that can process tasks in parallel, making them highly efficient for tasks like matrix operations and deep learning.
Distributed Computing uses multiple machines (nodes or servers) connected over a network to divide and execute tasks in parallel. It can be structured as clusters, grids, or cloud-based frameworks, using frameworks like Apache Spark.

Comparison Table¶

Feature	GPU Parallel Computing	Distributed Computing
Scope	Single machine, multiple cores	Multiple machines, multiple nodes
Hardware	GPUs (thousands of cores)	CPUs/GPUs in separate servers
Data Size	Limited by GPU memory (VRAM)	Can scale to petabytes of data
Computation Type	SIMD (Single Instruction, Multiple Data)	MIMD (Multiple Instructions, Multiple Data)
Best Use Cases	Deep learning, HPC, scientific computing	Big data analytics, large-scale web applications
Communication Overhead	Minimal (within one machine)	High (network-based communication)

When to Use What?¶

Use GPU parallel computing if your task involves massive numerical computations that fit within a single machine (e.g., deep learning, AI training).
Use distributed computing if you need to process extremely large datasets across multiple machines (e.g., big data analytics with Spark).

The move-code-to-data philosophy¶

The traditional supercomputer requires repeat transmissions of data between clients and servers. This works fine for computationally intensive work, but for data-intensive processing, the size of data becomes too large to be moved around easily.
A distributed systems focuses on moving code to data.
The clients send only the programs to be executed, and these programs are usually small.
More importantly, data are broken up and distributed across the cluster, and as much as possible, computation on a piece of data takes place on the same machine where that piece of data resides.
The whole process is known as MapReduce.

MapReduce

Spark-GitHub

What do we have with Spark?¶

Spark-ML

What do we ( statisticians ) miss with distributed platforms?¶

Interpretable statistical models such as GLM and Time Series Forecasting Models.
Efficient Bayesian inference tools such as MCMC, Gibbs and Variational Inference.
Distributed statistical visualization tools like ggplot2, seaborn and plotly
...

Why is it difficult to develop statistical models on distributed systems?¶

-- Especially for statisticians

No unified solutions to deploy conventional statistical methods to distributed computing platform.
Steep learning curve for using distributed computing.
Could not balance between estimator efficiency and communication cost.
Unrealistic models assumptions, e.g. requiring data randomly distributed.

Spark APIs for statisticians to develop distributed models¶

UDFs for DataFrames-based API¶

User-Defined Functions (UDFs) are a feature of Spark that allows users to define their own functions when the system's built-in functions are not enough to perform the desired task.
The API is available in Spark (>= 2.3).
It runs with PySpark (requiring Apache Arrow) and Scala.

Real projects on Spark¶

Code available at https://github.com/feng-li/dstats

DLSA: Least squares approximation for a distributed system¶

in Journal of Computational and Graphical Statistics, 2021 (with Xuening Zhu & Hansheng Wang) https://doi.org/10.1080/10618600.2021.1923517

We estimate the parameter $\theta$ on each worker separately by using local data on distributed workers. This can be done efficiently by using standard statistical estimation methods (e.g., maximum likelihood estimation).
Each worker passes the local estimator of $\theta$ and its asymptotic covariance estimate to the master.
A weighted least squares-type objective function can be constructed. This can be viewed as a local quadratic approximation of the global log-likelihood functions.

Efficiency and cost-effectiveness¶

A standard industrial-level architecture Spark-on-YARN cluster on the Alibaba cloud server consists of one master node and two worker nodes. Each node contains 64 virtual cores, 64 GB of RAM and two 80 GB SSD local hard drives. (cost 300 RMB per day}.
We find that $26.2$ minutes are needed for DLSA.
The traditional MLE takes more that $15$ hours and obtains an inferior result (cost 187 RMB).
That means we have saved 97% computational power. (cost only 6 RMB).

Distributed quantile regression by pilot sampling and one-step updating¶

in Journal of Business and Economic Statistics, 2021 (with Rui Pan, Tunan Ren, Baishan Guo, Guodong Li & Hansheng Wang) https://doi.org/10.1080/07350015.2021.1961789

We conduct a random sampling of size $n$ from the distributed system, where $n$ is much smaller than the whole sample size $N$.
Thereafter, a standard quantile regression estimator can be obtained on the master, which is referred to as the pilot estimator.
To further enhance the statistical efficiency, we propose a one-step Newton-Raphson type algorithm to upgrade the pilot estimator.

Distributed ARIMA models for ultra-long time series¶

in International Journal of Forecasting, 2023 (with Xiaoqian Wang, Yanfei Kang and Rob J Hyndman)

We develop a novel distributed forecasting framework to tackle challenges associated with forecasting ultra-long time series.
The proposed model combination approach facilitates distributed time series forecasting by combining the local estimators of time series models delivered from worker nodes and minimizing a global loss function.
In this way, instead of unrealistically assuming the data generating process (DGP) of an ultra-long time series stays invariant, we make assumptions only on the DGP of subseries spanning shorter time periods.

DARIMA

Take home message¶

Distributed modeling, computing and visualization are the future of forecasting.
Spark is not the only software for distributed statistical computing,
But is the easiest one.