Distributed Statistical Computing(大数据分布式计算)

Background

The Distributed Statistical Computing course was developed and taught by Dr. Feng Li in 2014 for a joint master’s program in statistics with prestigious universities, Peking University, Renmin University of China, Central University of Finance and Economics, University of Chinese Academy of Sciences, and Capital University of Economics and Business.

This course is also offered by Dr. Feng Li for the Business Analytics program at Peking University since 2020.

Prerequisites

  • Basic knowledge of statistics
  • Basic knowledge in computing

Literature

Teaching Videos

Slides and lecture notes

Read with online Jupyter Notebook viewer

Part I: Distributed Systems and Distributed Computing

Jupyter NotebooksSlidesVideos
L00: Linux BasicsHTML1
L01.1: Introduction to Distributed Computing
L01.2: Introduction to Hadoop
HTML
HTML
1, 2, 3, 4
1
L02.1: Understanding MapReduce
L02.2: MapReduce with Hadoop Streaming
HTML
HTML
L03.1: Statistics with Hadoop
L03.2: Statistical Modeling with MapReduce
HTML
HTML
L04: Distributed DatabasesHTML
L05.1: Introduction to Spark
L05.2: Datasets and Parallelization in Spark
HTML
HTML
L06.1: Structured Data Processing with Spark
L06.2: Working with Spark DataFrame
HTML
HTML
L07.1: Machine Learning with Spark
L07.2: Text Processing with Spark
HTML
HTML
L08.1: Introduction to Spark Streaming
L08.2: Spark Streaming in Details
HTML
HTML
L09.1: Introduction to Scala
L09.2: Scala for Spark
HTML
HTML
1
2

Part II: Advanced Distributed Statistical Computing

TopicSlides
L09:
L10:
L11:
L12:
L13:
L14:
L15:
L16: