{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Introduction to Spark\n", "\n", "\n", "\n", "## Feng Li\n", "\n", "### Central University of Finance and Economics\n", "\n", "### [feng.li@cufe.edu.cn](feng.li@cufe.edu.cn)\n", "### Course home page: [https://feng.li/distcomp](https://feng.li/distcomp)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Why Spark\n", "\n", "### Speed\n", "\n", "- Run workloads 100x faster. \n", "- Apache Spark achieves high performance for both batch and streaming data, using a state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine.\n", "\n", "![Spark-speed](./figures/spark-logistic-regression.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Who created Spark\n", "\n", "\n", "- Spark was a PhD student project in Berkerley University.\n", "\n", "- [Matei Zaharia](https://cs.stanford.edu/people/matei/) was the major contributor during his PhD at UC Berkeley in 2009.\n", "\n", "- [Matei’s research work](http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-12.pdf) was recognized through the 2014 ACM Doctoral Dissertation Award for the best PhD dissertation in computer science, an NSF CAREER Award, and the US Presidential Early Career Award for Scientists and Engineers (PECASE)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Ease of Use\n", "\n", "- Write applications quickly in Java, Scala, Python, R, and SQL. \n", "\n", "- Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala, Python, R, and SQL shells.\n", "\n", "\n", "- DataFrame with pandas API support" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Generality\n", "\n", "- Combine SQL, streaming, and complex analytics.\n", "\n", "- Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Runs Everywhere\n", "\n", "- Spark runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud. It can access diverse data sources.\n", "\n", "- You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, on Mesos, or on Kubernetes. Access data in HDFS, Alluxio, Apache Cassandra, Apache HBase, Apache Hive, and hundreds of other data sources.\n", "\n", "![Spark run everywhere](./figures/spark-runs-everywhere.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Spark architecture\n", "\n", "![Spark](./figures/spark-cluster-overview.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Spark Built-in Libraries:\n", "\n", "- SQL and DataFrames\n", "- Spark Streaming\n", "- MLlib, ML (machine learning)\n", "- GraphX (graph)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Launching Applications with `spark-submit`\n", "\n", "- Once a user application is bundled, it can be launched using the bin/spark-submit script. This script takes care of setting up the classpath with Spark and its dependencies, and can support different cluster managers and deploy modes that Spark supports:\n", "\n", "\n", " spark-submit \\\n", " --class \\\n", " --master \\\n", " --deploy-mode \\\n", " --conf = \\\n", " ... # other options\n", " \\\n", " [application-arguments]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Run on a YARN cluster\n", "\n", " spark-submit \\\n", " --class org.apache.spark.examples.SparkPi \\\n", " --master yarn \\\n", " --deploy-mode cluster \\ # can be client for client mode\n", " --executor-memory 20G \\\n", " --num-executors 50 \\\n", " /path/to/examples.jar \\\n", " 1000" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Run a Python application on a Spark-on-YARN cluster\n", " PYSPARK_PYTHON=python3.7 spark-submit \\\n", " --master yarn \\\n", " examples/src/main/python/pi.py \\\n", " 1000" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Spark with R\n", "\n", "- Spark also provides an experimental R API since 1.4 (only DataFrames APIs included). \n", "\n", "- To run Spark interactively in a R interpreter, use `sparkR`:\n", "\n", " sparkR --master local[2]\n", "\n", "- Example applications are also provided in R. For example,\n", "\n", " spark-submit examples/src/main/r/dataframe.R" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Run Spark via Pyspark\n", "\n", "- It is also possible to launch the PySpark shell. Set `PYSPARK_PYTHON` variable to select the approperate Python when running `pyspark` command:\n", "\n", " PYSPARK_PYTHON=python3.7 pyspark" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Run spark interactively within Python\n", "\n", "- You could use spark as a Python's module, but `PySpark` isn't on `sys.path` by default.\n", "\n", "- That doesn't mean it can't be used as a regular library. \n", "\n", "- You can address this by either symlinking pyspark into your site-packages, or adding pyspark to sys.path at runtime. findspark does the latter.\n", "\n", "- To initialize PySpark, just call it within Python" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "import findspark\n", "findspark.init()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Then you could import the `pyspark` module" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "import pyspark" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.17" }, "rise": { "auto_select": "first", "autolaunch": false, "chalkboard": { "chalkEffect": 1, "chalkWidth": 4, "theme": "whiteboard", "transition": 800 }, "enable_chalkboard": true, "reveal_shortcuts": { "chalkboard": { "clear": "ctrl-k" } }, "start_slideshow_at": "selected", "theme": "white" } }, "nbformat": 4, "nbformat_minor": 4 }