Distributed Statistical Computing(大数据分布式计算)

Background

The Distributed Statistical Computing course was developed and taught by Dr. Feng Li in 2014 for a joint master’s program in statistics with prestigious universities, Peking University, Renmin University of China, Central University of Finance and Economics, University of Chinese Academy of Sciences, and Capital University of Economics and Business.

This course is also offered by Dr. Feng Li for the Business Analytics program at Peking University from 2019-2023. The content of this course kept as is. An updated course Big Data Computation and Forecasting is available in Spring 2025 for Guanghua School of Management, Peking University.

Prerequisites

  • Basic knowledge of statistics
  • Basic knowledge in computing

Literature

Teaching videos

Slides and lecture notes

Part I: Distributed Systems and Distributed Computing

Jupyter NotebooksSlidesVideos
L00: Linux BasicsHTML1
L01.1: Introduction to Distributed Computing
L01.2: Introduction to Hadoop
HTML
HTML
1, 2, 3, 4
1
L02.1: Understanding MapReduce
L02.2: MapReduce with Hadoop Streaming
HTML
HTML
L03.1: Statistics with Hadoop
L03.2: Statistical Modeling with MapReduce
HTML
HTML
L04: Distributed DatabasesHTML
L05.1: Introduction to Spark
L05.2: Datasets and Parallelization in Spark
HTML
HTML
L06.1: Structured Data Processing with Spark
L06.2: Working with Spark DataFrame
HTML
HTML
L07.1: Machine Learning with Spark
L07.2: Text Processing with Spark
HTML
HTML
L08.1: Introduction to Spark Streaming
L08.2: Spark Streaming in Details
HTML
HTML
L09.1: Introduction to Scala
L09.2: Scala for Spark
HTML
HTML
1
2

Part II: Advanced Distributed Statistical Computing

TopicMaterial
L10.1: Big Data Visualization: Challenges and ViabilitiesHTML
L10.2: Statistical Elements of Big Data VisualizationHTML
L10.3: Computational Aspects of Big Data Visualization
L11: Distributed Statistical Computing: State of the Art
L11: Least-Square Approximation for a Distributed SystemPaper
Code
L12: Distributed ARIMA models for ultra-long time seriesPaper
Code
L13: Distributed Quantile Regression by Pilot Sampling and One-Step UpdatingPaper
Code
L14: Bayesian Forecasting with Distributed VAR models