{ "cells": [ { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "# Working with Spark DataFrame\n", "\n", "\n", "## Feng Li\n", "\n", "### Guanghua School of Management\n", "### Peking University\n", "\n", "\n", "### [feng.li@gsm.pku.edu.cn](feng.li@gsm.pku.edu.cn)\n", "### Course home page: [https://feng.li/bdcf](https://feng.li/bdcf)" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Start a Spark Session" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "editable": true, "slideshow": { "slide_type": "fragment" }, "tags": [] }, "outputs": [], "source": [ "import os, sys # Ensure All environment variables are properly set \n", "# os.environ[\"JAVA_HOME\"] = os.path.dirname(sys.executable)\n", "os.environ[\"PYSPARK_PYTHON\"] = sys.executable\n", "os.environ[\"PYSPARK_DRIVER_PYTHON\"] = sys.executable\n", "\n", "from pyspark.sql import SparkSession # build Spark Session\n", "spark = SparkSession.builder\\\n", " .config(\"spark.ui.enabled\", \"false\") \\\n", " .config(\"spark.executor.memory\", \"2g\")\\\n", " .config(\"spark.cores.max\", \"2\")\\\n", " .appName(\"Spark DataFrame\").getOrCreate() # using spark server" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "**Note:**\n", "\n", "- If you have trouble starting a PySpark interactive session due to a system limitation, make sure you have disabled Spark UI `.config(\"spark.ui.enabled\", \"false\")`\n", "\n", "- You could also submit your Spark Job via the `spark-submit` command in PKU HPC server (more details later).\n", "\n", " ```bash\n", " export JAVA_HOME=/nfs-share/software/anaconda/2020.02/envs/python3.12/bin\n", " export PYSPARK_PYTHON=/nfs-share/software/anaconda/2020.02/envs/python3.12/bin/python\n", " export PYSPARK_DRIVER_PYTHON=$PYSPARK_PYTHON\n", " /nfs-share/software/anaconda/2020.02/envs/python3.12/bin/spark-submit \\\n", " --conf spark.ui.enabled=false \\\n", " your-pyspark-code.py \n", " ```" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+----------+--------+---------+----+-----+----+----+-------------+------------+------------+------------+-------+-------+-------+\n", "| date|wm_yr_wk| weekday|wday|month|year| d| event_name_1|event_type_1|event_name_2|event_type_2|snap_CA|snap_TX|snap_WI|\n", "+----------+--------+---------+----+-----+----+----+-------------+------------+------------+------------+-------+-------+-------+\n", "|2011-01-29| 11101| Saturday| 1| 1|2011| d_1| NULL| NULL| NULL| NULL| 0| 0| 0|\n", "|2011-01-30| 11101| Sunday| 2| 1|2011| d_2| NULL| NULL| NULL| NULL| 0| 0| 0|\n", "|2011-01-31| 11101| Monday| 3| 1|2011| d_3| NULL| NULL| NULL| NULL| 0| 0| 0|\n", "|2011-02-01| 11101| Tuesday| 4| 2|2011| d_4| NULL| NULL| NULL| NULL| 1| 1| 0|\n", "|2011-02-02| 11101|Wednesday| 5| 2|2011| d_5| NULL| NULL| NULL| NULL| 1| 0| 1|\n", "|2011-02-03| 11101| Thursday| 6| 2|2011| d_6| NULL| NULL| NULL| NULL| 1| 1| 1|\n", "|2011-02-04| 11101| Friday| 7| 2|2011| d_7| NULL| NULL| NULL| NULL| 1| 0| 0|\n", "|2011-02-05| 11102| Saturday| 1| 2|2011| d_8| NULL| NULL| NULL| NULL| 1| 1| 1|\n", "|2011-02-06| 11102| Sunday| 2| 2|2011| d_9| SuperBowl| Sporting| NULL| NULL| 1| 1| 1|\n", "|2011-02-07| 11102| Monday| 3| 2|2011|d_10| NULL| NULL| NULL| NULL| 1| 1| 0|\n", "|2011-02-08| 11102| Tuesday| 4| 2|2011|d_11| NULL| NULL| NULL| NULL| 1| 0| 1|\n", "|2011-02-09| 11102|Wednesday| 5| 2|2011|d_12| NULL| NULL| NULL| NULL| 1| 1| 1|\n", "|2011-02-10| 11102| Thursday| 6| 2|2011|d_13| NULL| NULL| NULL| NULL| 1| 0| 0|\n", "|2011-02-11| 11102| Friday| 7| 2|2011|d_14| NULL| NULL| NULL| NULL| 0| 1| 1|\n", "|2011-02-12| 11103| Saturday| 1| 2|2011|d_15| NULL| NULL| NULL| NULL| 0| 1| 1|\n", "|2011-02-13| 11103| Sunday| 2| 2|2011|d_16| NULL| NULL| NULL| NULL| 0| 1| 0|\n", "|2011-02-14| 11103| Monday| 3| 2|2011|d_17|ValentinesDay| Cultural| NULL| NULL| 0| 0| 1|\n", "|2011-02-15| 11103| Tuesday| 4| 2|2011|d_18| NULL| NULL| NULL| NULL| 0| 1| 1|\n", "|2011-02-16| 11103|Wednesday| 5| 2|2011|d_19| NULL| NULL| NULL| NULL| 0| 0| 0|\n", "|2011-02-17| 11103| Thursday| 6| 2|2011|d_20| NULL| NULL| NULL| NULL| 0| 0| 0|\n", "+----------+--------+---------+----+-----+----+----+-------------+------------+------------+------------+-------+-------+-------+\n", "only showing top 20 rows\n", "\n" ] } ], "source": [ "sdf = spark.read.csv(\"../data/m5-forecasting-accuracy/calendar.csv\", header=True, inferSchema=True) # read files\n", "sdf.show() # Displays the content of the DataFrame to stdout" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Descriptive Statistics" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-------+----------+------------------+---------+------------------+-----------------+------------------+-----+------------+------------+--------------+------------+-------------------+-------------------+-------------------+\n", "|summary| date| wm_yr_wk| weekday| wday| month| year| d|event_name_1|event_type_1| event_name_2|event_type_2| snap_CA| snap_TX| snap_WI|\n", "+-------+----------+------------------+---------+------------------+-----------------+------------------+-----+------------+------------+--------------+------------+-------------------+-------------------+-------------------+\n", "| count| 1969| 1969| 1969| 1969| 1969| 1969| 1969| 162| 162| 5| 5| 1969| 1969| 1969|\n", "| mean| NULL|11347.086338242763| NULL|3.9974606399187405|6.325545962417471| 2013.288471305231| NULL| NULL| NULL| NULL| NULL|0.33011681056373793|0.33011681056373793|0.33011681056373793|\n", "| stddev| NULL| 155.2770428028507| NULL|2.0011413541040746|3.416864338775945|1.5801982706329631| NULL| NULL| NULL| NULL| NULL|0.47037439309734164|0.47037439309734164|0.47037439309734164|\n", "| min|2011-01-29| 11101| Friday| 1| 1| 2011| d_1|Chanukah End| Cultural| Cinco De Mayo| Cultural| 0| 0| 0|\n", "| max|2016-06-19| 11621|Wednesday| 7| 9| 2016|d_999| VeteransDay| Sporting|OrthodoxEaster| Religious| 1| 1| 1|\n", "+-------+----------+------------------+---------+------------------+-----------------+------------------+-----+------------+------------+--------------+------------+-------------------+-------------------+-------------------+\n", "\n" ] } ], "source": [ "sdf.describe().show() # ugly description" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
| \n", " | summary | \n", "date | \n", "wm_yr_wk | \n", "weekday | \n", "wday | \n", "month | \n", "year | \n", "d | \n", "event_name_1 | \n", "event_type_1 | \n", "event_name_2 | \n", "event_type_2 | \n", "snap_CA | \n", "snap_TX | \n", "snap_WI | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "count | \n", "1969 | \n", "1969 | \n", "1969 | \n", "1969 | \n", "1969 | \n", "1969 | \n", "1969 | \n", "162 | \n", "162 | \n", "5 | \n", "5 | \n", "1969 | \n", "1969 | \n", "1969 | \n", "
| 1 | \n", "mean | \n", "None | \n", "11347.086338242763 | \n", "None | \n", "3.9974606399187405 | \n", "6.325545962417471 | \n", "2013.288471305231 | \n", "None | \n", "None | \n", "None | \n", "None | \n", "None | \n", "0.33011681056373793 | \n", "0.33011681056373793 | \n", "0.33011681056373793 | \n", "
| 2 | \n", "stddev | \n", "None | \n", "155.2770428028507 | \n", "None | \n", "2.0011413541040746 | \n", "3.416864338775945 | \n", "1.5801982706329631 | \n", "None | \n", "None | \n", "None | \n", "None | \n", "None | \n", "0.47037439309734164 | \n", "0.47037439309734164 | \n", "0.47037439309734164 | \n", "
| 3 | \n", "min | \n", "2011-01-29 | \n", "11101 | \n", "Friday | \n", "1 | \n", "1 | \n", "2011 | \n", "d_1 | \n", "Chanukah End | \n", "Cultural | \n", "Cinco De Mayo | \n", "Cultural | \n", "0 | \n", "0 | \n", "0 | \n", "
| 4 | \n", "max | \n", "2016-06-19 | \n", "11621 | \n", "Wednesday | \n", "7 | \n", "9 | \n", "2016 | \n", "d_999 | \n", "VeteransDay | \n", "Sporting | \n", "OrthodoxEaster | \n", "Religious | \n", "1 | \n", "1 | \n", "1 | \n", "
| \n", " | month | \n", "weekday | \n", "count | \n", "
|---|---|---|---|
| 0 | \n", "3 | \n", "Monday | \n", "26 | \n", "
| 1 | \n", "11 | \n", "Saturday | \n", "22 | \n", "
| 2 | \n", "10 | \n", "Saturday | \n", "22 | \n", "
| 3 | \n", "4 | \n", "Wednesday | \n", "26 | \n", "
| 4 | \n", "1 | \n", "Monday | \n", "22 | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "
| 79 | \n", "5 | \n", "Thursday | \n", "27 | \n", "
| 80 | \n", "2 | \n", "Wednesday | \n", "25 | \n", "
| 81 | \n", "1 | \n", "Thursday | \n", "23 | \n", "
| 82 | \n", "2 | \n", "Monday | \n", "25 | \n", "
| 83 | \n", "7 | \n", "Wednesday | \n", "23 | \n", "
84 rows × 3 columns
\n", "