{ "cells": [ { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "# Introduction to Spark\n", "\n", "\n", "\n", "## Feng Li\n", "\n", "### Guanghua School of Management\n", "### Peking University\n", "\n", "\n", "### [feng.li@gsm.pku.edu.cn](feng.li@gsm.pku.edu.cn)\n", "### Course home page: [https://feng.li/bdcf](https://feng.li/bdcf)" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "# What is Spark\n", "\n", "Spark is the **big data distributed computing** version of TensorFlow/PyTorch for deep learning." ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Why Spark\n", "\n", "### Speed\n", "\n", "- Run workloads 100x faster. \n", "- Apache Spark achieves high performance for both batch and streaming data, using a state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "### Programming & Usability\n", "\n", "| Feature | **Apache Spark** | **TensorFlow/PyTorch** |\n", "|---------|----------------|----------------------|\n", "| **Languages Supported** | Scala, Java, Python, R | Python (main), C++, Java (limited) |\n", "| **Ease of Use** | High-level APIs (DataFrame, SQL) | PyTorch: Easier for research; TensorFlow: More optimized for production |\n", "| **Learning Curve** | Moderate (requires understanding of distributed computing) | PyTorch: Easier; TensorFlow: Steeper curve (especially older versions) |" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "### Scalability & Performance \n", "| Feature | **Apache Spark** | **TensorFlow/PyTorch** |\n", "|---------|----------------|----------------------|\n", "| **Scalability** | Scales horizontally (across clusters) | Scales vertically (leveraging multiple GPUs/TPUs) |\n", "| **Distributed Computing** | Yes (native support for cluster computing) | Requires additional tools like Horovod or PyTorch Distributed |\n", "| **Speed** | Optimized for large-scale distributed data processing | Optimized for numerical operations and matrix multiplications |" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "### Machine Learning Support\n", "\n", "| Feature | **Apache Spark** | **TensorFlow/PyTorch** |\n", "|---------|----------------|----------------------|\n", "| **Built-in ML** | Yes (MLlib, MLFlow integration) | No (but specialized for deep learning) |\n", "| **Deep Learning Support** | Limited (via TensorFlowOnSpark, Deep Learning Pipelines) | Specialized for neural networks, CNNs, RNNs, transformers, etc. |\n" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "# Who created Spark\n", "\n", "\n", "- Spark was a PhD student project in Berkerley University.\n", "\n", "- [Matei Zaharia](https://cs.stanford.edu/people/matei/) was the major contributor during his PhD at UC Berkeley in 2009.\n", "\n", "- [Matei’s research work](http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-12.pdf) was recognized through the 2014 ACM Doctoral Dissertation Award for the best PhD dissertation in computer science, an NSF CAREER Award, and the US Presidential Early Career Award for Scientists and Engineers (PECASE)." ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "### Ease of Use\n", "\n", "- Write applications quickly in Java, Scala, Python, R, and SQL. \n", "\n", "- Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala, Python, R, and SQL shells.\n", "\n", "\n", "- DataFrame with pandas API support" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "### Generality\n", "\n", "- Combine SQL, streaming, and complex analytics.\n", "\n", "- Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application." ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "### Runs Everywhere\n", "\n", "- Spark runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud. It can access diverse data sources.\n", "\n", "- You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, on Mesos, or on Kubernetes. Access data in HDFS, Alluxio, Apache Cassandra, Apache HBase, Apache Hive, and hundreds of other data sources." ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Spark architecture\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Spark Built-in Libraries:\n", "\n", "- SQL and DataFrames\n", "- Spark Streaming\n", "- MLlib, ML (machine learning)\n", "- GraphX (graph)" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Run a Python application on a Spark cluster\n", "\n", "```bash\n", "PYSPARK_PYTHON=python3.12 spark-submit \\\n", " --master https:1.2.3.4 \\\n", " examples/src/main/python/pi.py \\\n", " 1000\n", "```" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Interactively Run Spark via Pyspark\n", "\n", "- It is also possible to launch the PySpark shell. Set `PYSPARK_PYTHON` variable to select the approperate Python when running `pyspark` command:\n", "\n", " ```bash\n", " PYSPARK_PYTHON=python3.12 pyspark\n", " ```\n" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Run Spark interactively within Jupyter Notebook\n", "\n", "- You could use Spark as a Python module\n", "\n", "- On PKU HPC, you could access it by create your own notebook at\n", "\n", " https://scow-jx2.pku.edu.cn/apps/jx2/createApps\n", "\n", "\n", "- Then you could import the `pyspark` module\n" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Install Python modules on PKU HPC\n", "\n", "- The running nodes (the Jupyter Notebook) do not have internet access\n", "\n", "- One has to install everything from the login node using proper version of Python pip\n", "\n", " - In the **Jupyter**, use `os.path.dirname(sys.executable)` to allocate the path for pip. It returns something like\n", "\n", " /nfs-share/software/anaconda/2020.02/envs/python3.12/bin\n", " \n", " - In the **login node**, run like\n", " \n", " /nfs-share/software/anaconda/2020.02/envs/python3.12/bin/pip install statsmodels\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Setting default log level to \"WARN\".\n", "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n", "25/02/23 21:40:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "3.5.4\n" ] } ], "source": [ "from pyspark.sql import SparkSession\n", "# Initialize a Spark session\n", "spark = SparkSession.builder \\\n", " .config(\"spark.ui.enabled\", \"false\") \\\n", " .appName(\"MyPySparkApp\") \\\n", " .getOrCreate()\n", "\n", "# Check if Spark is running\n", "print(spark.version)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "\n", "
SparkSession - in-memory
\n", " \n", "SparkContext
\n", "\n", " \n", "\n", "v3.5.4
local[*]
MyPySparkApp