{ "cells": [ { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "# Working with Spark DataFrame\n", "\n", "\n", "## Feng Li\n", "\n", "### Guanghua School of Management\n", "### Peking University\n", "\n", "\n", "### [feng.li@gsm.pku.edu.cn](feng.li@gsm.pku.edu.cn)\n", "### Course home page: [https://feng.li/bdcf](https://feng.li/bdcf)" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Start a Spark Session" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "editable": true, "slideshow": { "slide_type": "fragment" }, "tags": [] }, "outputs": [], "source": [ "import os, sys # Ensure All environment variables are properly set \n", "# os.environ[\"JAVA_HOME\"] = os.path.dirname(sys.executable)\n", "os.environ[\"PYSPARK_PYTHON\"] = sys.executable\n", "os.environ[\"PYSPARK_DRIVER_PYTHON\"] = sys.executable\n", "\n", "from pyspark.sql import SparkSession # build Spark Session\n", "spark = SparkSession.builder\\\n", " .config(\"spark.ui.enabled\", \"false\") \\\n", " .config(\"spark.executor.memory\", \"2g\")\\\n", " .config(\"spark.cores.max\", \"2\")\\\n", " .appName(\"Spark DataFrame\").getOrCreate() # using spark server" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "**Note:**\n", "\n", "- If you have trouble starting a PySpark interactive session due to a system limitation, make sure you have disabled Spark UI `.config(\"spark.ui.enabled\", \"false\")`\n", "\n", "- You could also submit your Spark Job via the `spark-submit` command in PKU HPC server (more details later).\n", "\n", " ```bash\n", " export JAVA_HOME=/nfs-share/software/anaconda/2020.02/envs/python3.12/bin\n", " export PYSPARK_PYTHON=/nfs-share/software/anaconda/2020.02/envs/python3.12/bin/python\n", " export PYSPARK_DRIVER_PYTHON=$PYSPARK_PYTHON\n", " /nfs-share/software/anaconda/2020.02/envs/python3.12/bin/spark-submit \\\n", " --conf spark.ui.enabled=false \\\n", " your-pyspark-code.py \n", " ```" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+----------+--------+---------+----+-----+----+----+-------------+------------+------------+------------+-------+-------+-------+\n", "| date|wm_yr_wk| weekday|wday|month|year| d| event_name_1|event_type_1|event_name_2|event_type_2|snap_CA|snap_TX|snap_WI|\n", "+----------+--------+---------+----+-----+----+----+-------------+------------+------------+------------+-------+-------+-------+\n", "|2011-01-29| 11101| Saturday| 1| 1|2011| d_1| NULL| NULL| NULL| NULL| 0| 0| 0|\n", "|2011-01-30| 11101| Sunday| 2| 1|2011| d_2| NULL| NULL| NULL| NULL| 0| 0| 0|\n", "|2011-01-31| 11101| Monday| 3| 1|2011| d_3| NULL| NULL| NULL| NULL| 0| 0| 0|\n", "|2011-02-01| 11101| Tuesday| 4| 2|2011| d_4| NULL| NULL| NULL| NULL| 1| 1| 0|\n", "|2011-02-02| 11101|Wednesday| 5| 2|2011| d_5| NULL| NULL| NULL| NULL| 1| 0| 1|\n", "|2011-02-03| 11101| Thursday| 6| 2|2011| d_6| NULL| NULL| NULL| NULL| 1| 1| 1|\n", "|2011-02-04| 11101| Friday| 7| 2|2011| d_7| NULL| NULL| NULL| NULL| 1| 0| 0|\n", "|2011-02-05| 11102| Saturday| 1| 2|2011| d_8| NULL| NULL| NULL| NULL| 1| 1| 1|\n", "|2011-02-06| 11102| Sunday| 2| 2|2011| d_9| SuperBowl| Sporting| NULL| NULL| 1| 1| 1|\n", "|2011-02-07| 11102| Monday| 3| 2|2011|d_10| NULL| NULL| NULL| NULL| 1| 1| 0|\n", "|2011-02-08| 11102| Tuesday| 4| 2|2011|d_11| NULL| NULL| NULL| NULL| 1| 0| 1|\n", "|2011-02-09| 11102|Wednesday| 5| 2|2011|d_12| NULL| NULL| NULL| NULL| 1| 1| 1|\n", "|2011-02-10| 11102| Thursday| 6| 2|2011|d_13| NULL| NULL| NULL| NULL| 1| 0| 0|\n", "|2011-02-11| 11102| Friday| 7| 2|2011|d_14| NULL| NULL| NULL| NULL| 0| 1| 1|\n", "|2011-02-12| 11103| Saturday| 1| 2|2011|d_15| NULL| NULL| NULL| NULL| 0| 1| 1|\n", "|2011-02-13| 11103| Sunday| 2| 2|2011|d_16| NULL| NULL| NULL| NULL| 0| 1| 0|\n", "|2011-02-14| 11103| Monday| 3| 2|2011|d_17|ValentinesDay| Cultural| NULL| NULL| 0| 0| 1|\n", "|2011-02-15| 11103| Tuesday| 4| 2|2011|d_18| NULL| NULL| NULL| NULL| 0| 1| 1|\n", "|2011-02-16| 11103|Wednesday| 5| 2|2011|d_19| NULL| NULL| NULL| NULL| 0| 0| 0|\n", "|2011-02-17| 11103| Thursday| 6| 2|2011|d_20| NULL| NULL| NULL| NULL| 0| 0| 0|\n", "+----------+--------+---------+----+-----+----+----+-------------+------------+------------+------------+-------+-------+-------+\n", "only showing top 20 rows\n", "\n" ] } ], "source": [ "sdf = spark.read.csv(\"../data/m5-forecasting-accuracy/calendar.csv\", header=True, inferSchema=True) # read files\n", "sdf.show() # Displays the content of the DataFrame to stdout" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Descriptive Statistics" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-------+----------+------------------+---------+------------------+-----------------+------------------+-----+------------+------------+--------------+------------+-------------------+-------------------+-------------------+\n", "|summary| date| wm_yr_wk| weekday| wday| month| year| d|event_name_1|event_type_1| event_name_2|event_type_2| snap_CA| snap_TX| snap_WI|\n", "+-------+----------+------------------+---------+------------------+-----------------+------------------+-----+------------+------------+--------------+------------+-------------------+-------------------+-------------------+\n", "| count| 1969| 1969| 1969| 1969| 1969| 1969| 1969| 162| 162| 5| 5| 1969| 1969| 1969|\n", "| mean| NULL|11347.086338242763| NULL|3.9974606399187405|6.325545962417471| 2013.288471305231| NULL| NULL| NULL| NULL| NULL|0.33011681056373793|0.33011681056373793|0.33011681056373793|\n", "| stddev| NULL| 155.2770428028507| NULL|2.0011413541040746|3.416864338775945|1.5801982706329631| NULL| NULL| NULL| NULL| NULL|0.47037439309734164|0.47037439309734164|0.47037439309734164|\n", "| min|2011-01-29| 11101| Friday| 1| 1| 2011| d_1|Chanukah End| Cultural| Cinco De Mayo| Cultural| 0| 0| 0|\n", "| max|2016-06-19| 11621|Wednesday| 7| 9| 2016|d_999| VeteransDay| Sporting|OrthodoxEaster| Religious| 1| 1| 1|\n", "+-------+----------+------------------+---------+------------------+-----------------+------------------+-----+------------+------------+--------------+------------+-------------------+-------------------+-------------------+\n", "\n" ] } ], "source": [ "sdf.describe().show() # ugly description" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>summary</th>\n", " <th>date</th>\n", " <th>wm_yr_wk</th>\n", " <th>weekday</th>\n", " <th>wday</th>\n", " <th>month</th>\n", " <th>year</th>\n", " <th>d</th>\n", " <th>event_name_1</th>\n", " <th>event_type_1</th>\n", " <th>event_name_2</th>\n", " <th>event_type_2</th>\n", " <th>snap_CA</th>\n", " <th>snap_TX</th>\n", " <th>snap_WI</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>count</td>\n", " <td>1969</td>\n", " <td>1969</td>\n", " <td>1969</td>\n", " <td>1969</td>\n", " <td>1969</td>\n", " <td>1969</td>\n", " <td>1969</td>\n", " <td>162</td>\n", " <td>162</td>\n", " <td>5</td>\n", " <td>5</td>\n", " <td>1969</td>\n", " <td>1969</td>\n", " <td>1969</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>mean</td>\n", " <td>None</td>\n", " <td>11347.086338242763</td>\n", " <td>None</td>\n", " <td>3.9974606399187405</td>\n", " <td>6.325545962417471</td>\n", " <td>2013.288471305231</td>\n", " <td>None</td>\n", " <td>None</td>\n", " <td>None</td>\n", " <td>None</td>\n", " <td>None</td>\n", " <td>0.33011681056373793</td>\n", " <td>0.33011681056373793</td>\n", " <td>0.33011681056373793</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>stddev</td>\n", " <td>None</td>\n", " <td>155.2770428028507</td>\n", " <td>None</td>\n", " <td>2.0011413541040746</td>\n", " <td>3.416864338775945</td>\n", " <td>1.5801982706329631</td>\n", " <td>None</td>\n", " <td>None</td>\n", " <td>None</td>\n", " <td>None</td>\n", " <td>None</td>\n", " <td>0.47037439309734164</td>\n", " <td>0.47037439309734164</td>\n", " <td>0.47037439309734164</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>min</td>\n", " <td>2011-01-29</td>\n", " <td>11101</td>\n", " <td>Friday</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>2011</td>\n", " <td>d_1</td>\n", " <td>Chanukah End</td>\n", " <td>Cultural</td>\n", " <td>Cinco De Mayo</td>\n", " <td>Cultural</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>max</td>\n", " <td>2016-06-19</td>\n", " <td>11621</td>\n", " <td>Wednesday</td>\n", " <td>7</td>\n", " <td>9</td>\n", " <td>2016</td>\n", " <td>d_999</td>\n", " <td>VeteransDay</td>\n", " <td>Sporting</td>\n", " <td>OrthodoxEaster</td>\n", " <td>Religious</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>1</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " summary date wm_yr_wk weekday wday \\\n", "0 count 1969 1969 1969 1969 \n", "1 mean None 11347.086338242763 None 3.9974606399187405 \n", "2 stddev None 155.2770428028507 None 2.0011413541040746 \n", "3 min 2011-01-29 11101 Friday 1 \n", "4 max 2016-06-19 11621 Wednesday 7 \n", "\n", " month year d event_name_1 event_type_1 \\\n", "0 1969 1969 1969 162 162 \n", "1 6.325545962417471 2013.288471305231 None None None \n", "2 3.416864338775945 1.5801982706329631 None None None \n", "3 1 2011 d_1 Chanukah End Cultural \n", "4 9 2016 d_999 VeteransDay Sporting \n", "\n", " event_name_2 event_type_2 snap_CA snap_TX \\\n", "0 5 5 1969 1969 \n", "1 None None 0.33011681056373793 0.33011681056373793 \n", "2 None None 0.47037439309734164 0.47037439309734164 \n", "3 Cinco De Mayo Cultural 0 0 \n", "4 OrthodoxEaster Religious 1 1 \n", "\n", " snap_WI \n", "0 1969 \n", "1 0.33011681056373793 \n", "2 0.47037439309734164 \n", "3 0 \n", "4 1 " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sdf.describe().toPandas() # pretty" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-------+------------------+\n", "|summary| year|\n", "+-------+------------------+\n", "| count| 1969|\n", "| mean| 2013.288471305231|\n", "| stddev|1.5801982706329631|\n", "| min| 2011|\n", "| max| 2016|\n", "+-------+------------------+\n", "\n" ] } ], "source": [ "sdf.describe(['year']).show()" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "### Print the schema in a tree format" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "editable": true, "slideshow": { "slide_type": "fragment" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "root\n", " |-- date: string (nullable = true)\n", " |-- wm_yr_wk: string (nullable = true)\n", " |-- weekday: string (nullable = true)\n", " |-- wday: string (nullable = true)\n", " |-- month: string (nullable = true)\n", " |-- year: string (nullable = true)\n", " |-- d: string (nullable = true)\n", " |-- event_name_1: string (nullable = true)\n", " |-- event_type_1: string (nullable = true)\n", " |-- event_name_2: string (nullable = true)\n", " |-- event_type_2: string (nullable = true)\n", " |-- snap_CA: string (nullable = true)\n", " |-- snap_TX: string (nullable = true)\n", " |-- snap_WI: string (nullable = true)\n", "\n" ] } ], "source": [ "sdf.printSchema()" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "### Select columns" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+----------+----+-------+-------+-------+\n", "| date|year|snap_CA|snap_TX|snap_WI|\n", "+----------+----+-------+-------+-------+\n", "|2011-01-29|2011| 0| 0| 0|\n", "|2011-01-30|2011| 0| 0| 0|\n", "|2011-01-31|2011| 0| 0| 0|\n", "|2011-02-01|2011| 1| 1| 0|\n", "|2011-02-02|2011| 1| 0| 1|\n", "|2011-02-03|2011| 1| 1| 1|\n", "|2011-02-04|2011| 1| 0| 0|\n", "|2011-02-05|2011| 1| 1| 1|\n", "|2011-02-06|2011| 1| 1| 1|\n", "|2011-02-07|2011| 1| 1| 0|\n", "|2011-02-08|2011| 1| 0| 1|\n", "|2011-02-09|2011| 1| 1| 1|\n", "|2011-02-10|2011| 1| 0| 0|\n", "|2011-02-11|2011| 0| 1| 1|\n", "|2011-02-12|2011| 0| 1| 1|\n", "|2011-02-13|2011| 0| 1| 0|\n", "|2011-02-14|2011| 0| 0| 1|\n", "|2011-02-15|2011| 0| 1| 1|\n", "|2011-02-16|2011| 0| 0| 0|\n", "|2011-02-17|2011| 0| 0| 0|\n", "+----------+----+-------+-------+-------+\n", "only showing top 20 rows\n", "\n" ] } ], "source": [ "sdf.select([\"date\",\"year\",\"snap_CA\", \"snap_TX\", \"snap_WI\"]).show()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+----+------------+\n", "|year|(month > 11)|\n", "+----+------------+\n", "|2011| false|\n", "|2011| false|\n", "|2011| false|\n", "|2011| false|\n", "|2011| false|\n", "|2011| false|\n", "|2011| false|\n", "|2011| false|\n", "|2011| false|\n", "|2011| false|\n", "|2011| false|\n", "|2011| false|\n", "|2011| false|\n", "|2011| false|\n", "|2011| false|\n", "|2011| false|\n", "|2011| false|\n", "|2011| false|\n", "|2011| false|\n", "|2011| false|\n", "+----+------------+\n", "only showing top 20 rows\n", "\n" ] } ], "source": [ "sdf.select(sdf['year'], sdf['month']>11).show()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-----+---------+-----+\n", "|month| weekday|count|\n", "+-----+---------+-----+\n", "| 3| Monday| 26|\n", "| 11| Saturday| 22|\n", "| 10| Saturday| 22|\n", "| 4|Wednesday| 26|\n", "| 1| Monday| 22|\n", "| 2| Friday| 24|\n", "| 11| Sunday| 22|\n", "| 4| Sunday| 25|\n", "| 8| Sunday| 22|\n", "| 6| Monday| 24|\n", "| 9| Monday| 22|\n", "| 4| Friday| 26|\n", "| 1| Friday| 23|\n", "| 7| Monday| 22|\n", "| 4| Tuesday| 26|\n", "| 6| Thursday| 24|\n", "| 1|Wednesday| 22|\n", "| 4| Monday| 26|\n", "| 3| Tuesday| 27|\n", "| 3| Friday| 26|\n", "+-----+---------+-----+\n", "only showing top 20 rows\n", "\n" ] } ], "source": [ "# group data with respect to some columns \n", "sdf.groupBy([\"month\",\"weekday\"]).count().show() " ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>month</th>\n", " <th>weekday</th>\n", " <th>count</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>3</td>\n", " <td>Monday</td>\n", " <td>26</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>11</td>\n", " <td>Saturday</td>\n", " <td>22</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>10</td>\n", " <td>Saturday</td>\n", " <td>22</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>4</td>\n", " <td>Wednesday</td>\n", " <td>26</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>1</td>\n", " <td>Monday</td>\n", " <td>22</td>\n", " </tr>\n", " <tr>\n", " <th>...</th>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " </tr>\n", " <tr>\n", " <th>79</th>\n", " <td>5</td>\n", " <td>Thursday</td>\n", " <td>27</td>\n", " </tr>\n", " <tr>\n", " <th>80</th>\n", " <td>2</td>\n", " <td>Wednesday</td>\n", " <td>25</td>\n", " </tr>\n", " <tr>\n", " <th>81</th>\n", " <td>1</td>\n", " <td>Thursday</td>\n", " <td>23</td>\n", " </tr>\n", " <tr>\n", " <th>82</th>\n", " <td>2</td>\n", " <td>Monday</td>\n", " <td>25</td>\n", " </tr>\n", " <tr>\n", " <th>83</th>\n", " <td>7</td>\n", " <td>Wednesday</td>\n", " <td>23</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "<p>84 rows × 3 columns</p>\n", "</div>" ], "text/plain": [ " month weekday count\n", "0 3 Monday 26\n", "1 11 Saturday 22\n", "2 10 Saturday 22\n", "3 4 Wednesday 26\n", "4 1 Monday 22\n", ".. ... ... ...\n", "79 5 Thursday 27\n", "80 2 Wednesday 25\n", "81 1 Thursday 23\n", "82 2 Monday 25\n", "83 7 Wednesday 23\n", "\n", "[84 rows x 3 columns]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sdf.groupBy([\"month\",\"weekday\"]).count().toPandas()" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+------------+-----+\n", "|event_type_1|count|\n", "+------------+-----+\n", "| NULL| 1807|\n", "| Religious| 55|\n", "| National| 52|\n", "| Cultural| 37|\n", "| Sporting| 18|\n", "+------------+-----+\n", "\n" ] } ], "source": [ "## Group and sort\n", "event1count=sdf.groupBy(\"event_type_1\").count()\n", "event1count.sort(\"count\", ascending=False).show()" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "### Data cleaning" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "1969" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## Returns a new DataFrame containing the distinct rows in this DataFrame.\n", "## Takes a while to compute\n", "sdf.distinct().count()" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+----------+--------+-------+----+-----+----+------+--------------+------------+--------------+------------+-------+-------+-------+\n", "| date|wm_yr_wk|weekday|wday|month|year| d| event_name_1|event_type_1| event_name_2|event_type_2|snap_CA|snap_TX|snap_WI|\n", "+----------+--------+-------+----+-----+----+------+--------------+------------+--------------+------------+-------+-------+-------+\n", "|2011-04-24| 11113| Sunday| 2| 4|2011| d_86|OrthodoxEaster| Religious| Easter| Cultural| 0| 0| 0|\n", "|2013-05-05| 11315| Sunday| 2| 5|2013| d_828|OrthodoxEaster| Religious| Cinco De Mayo| Cultural| 1| 1| 1|\n", "|2014-04-20| 11412| Sunday| 2| 4|2014|d_1178| Easter| Cultural|OrthodoxEaster| Religious| 0| 0| 0|\n", "|2014-06-15| 11420| Sunday| 2| 6|2014|d_1234| NBAFinalsEnd| Sporting| Father's day| Cultural| 0| 1| 1|\n", "|2016-06-19| 11621| Sunday| 2| 6|2016|d_1969| NBAFinalsEnd| Sporting| Father's day| Cultural| 0| 0| 0|\n", "+----------+--------+-------+----+-----+----+------+--------------+------------+--------------+------------+-------+-------+-------+\n", "\n" ] } ], "source": [ "## Returns a new DataFrame omitting rows with null values\n", "sdf_without_na = sdf.na.drop()\n", "sdf_without_na.show()" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "1969" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sdf.count() # original file size" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+----------+--------+---------+----+-----+----+----+-------------+------------+------------+------------+-------+-------+-------+\n", "| date|wm_yr_wk| weekday|wday|month|year| d| event_name_1|event_type_1|event_name_2|event_type_2|snap_CA|snap_TX|snap_WI|\n", "+----------+--------+---------+----+-----+----+----+-------------+------------+------------+------------+-------+-------+-------+\n", "|2011-01-29| 11101| Saturday| 1| 1|2011| d_1| unknown| unknown| unknown| unknown| 0| 0| 0|\n", "|2011-01-30| 11101| Sunday| 2| 1|2011| d_2| unknown| unknown| unknown| unknown| 0| 0| 0|\n", "|2011-01-31| 11101| Monday| 3| 1|2011| d_3| unknown| unknown| unknown| unknown| 0| 0| 0|\n", "|2011-02-01| 11101| Tuesday| 4| 2|2011| d_4| unknown| unknown| unknown| unknown| 1| 1| 0|\n", "|2011-02-02| 11101|Wednesday| 5| 2|2011| d_5| unknown| unknown| unknown| unknown| 1| 0| 1|\n", "|2011-02-03| 11101| Thursday| 6| 2|2011| d_6| unknown| unknown| unknown| unknown| 1| 1| 1|\n", "|2011-02-04| 11101| Friday| 7| 2|2011| d_7| unknown| unknown| unknown| unknown| 1| 0| 0|\n", "|2011-02-05| 11102| Saturday| 1| 2|2011| d_8| unknown| unknown| unknown| unknown| 1| 1| 1|\n", "|2011-02-06| 11102| Sunday| 2| 2|2011| d_9| SuperBowl| Sporting| unknown| unknown| 1| 1| 1|\n", "|2011-02-07| 11102| Monday| 3| 2|2011|d_10| unknown| unknown| unknown| unknown| 1| 1| 0|\n", "|2011-02-08| 11102| Tuesday| 4| 2|2011|d_11| unknown| unknown| unknown| unknown| 1| 0| 1|\n", "|2011-02-09| 11102|Wednesday| 5| 2|2011|d_12| unknown| unknown| unknown| unknown| 1| 1| 1|\n", "|2011-02-10| 11102| Thursday| 6| 2|2011|d_13| unknown| unknown| unknown| unknown| 1| 0| 0|\n", "|2011-02-11| 11102| Friday| 7| 2|2011|d_14| unknown| unknown| unknown| unknown| 0| 1| 1|\n", "|2011-02-12| 11103| Saturday| 1| 2|2011|d_15| unknown| unknown| unknown| unknown| 0| 1| 1|\n", "|2011-02-13| 11103| Sunday| 2| 2|2011|d_16| unknown| unknown| unknown| unknown| 0| 1| 0|\n", "|2011-02-14| 11103| Monday| 3| 2|2011|d_17|ValentinesDay| Cultural| unknown| unknown| 0| 0| 1|\n", "|2011-02-15| 11103| Tuesday| 4| 2|2011|d_18| unknown| unknown| unknown| unknown| 0| 1| 1|\n", "|2011-02-16| 11103|Wednesday| 5| 2|2011|d_19| unknown| unknown| unknown| unknown| 0| 0| 0|\n", "|2011-02-17| 11103| Thursday| 6| 2|2011|d_20| unknown| unknown| unknown| unknown| 0| 0| 0|\n", "+----------+--------+---------+----+-----+----+----+-------------+------------+------------+------------+-------+-------+-------+\n", "only showing top 20 rows\n", "\n" ] } ], "source": [ "## Replace null values\n", "sdf.na.fill(\"unknown\").show()" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## User-defined functions" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [], "source": [ "from pyspark.sql.functions import udf\n", "from pyspark.sql.types import StringType\n", "\n", "# Define a simple UDF to convert strings to uppercase\n", "def to_uppercase(s):\n", " return s.upper() if s else None\n", "\n", "# Register UDF\n", "uppercase_udf = udf(to_uppercase, StringType())\n", "\n", "# Apply UDF to the 'weekday' column\n", "df_transformed = sdf.withColumn(\"weekday_upper\", uppercase_udf(sdf[\"weekday\"]))\n" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+---------+-------------+\n", "| weekday|weekday_upper|\n", "+---------+-------------+\n", "| Saturday| SATURDAY|\n", "| Sunday| SUNDAY|\n", "| Monday| MONDAY|\n", "| Tuesday| TUESDAY|\n", "|Wednesday| WEDNESDAY|\n", "| Thursday| THURSDAY|\n", "| Friday| FRIDAY|\n", "| Saturday| SATURDAY|\n", "| Sunday| SUNDAY|\n", "| Monday| MONDAY|\n", "| Tuesday| TUESDAY|\n", "|Wednesday| WEDNESDAY|\n", "| Thursday| THURSDAY|\n", "| Friday| FRIDAY|\n", "| Saturday| SATURDAY|\n", "| Sunday| SUNDAY|\n", "| Monday| MONDAY|\n", "| Tuesday| TUESDAY|\n", "|Wednesday| WEDNESDAY|\n", "| Thursday| THURSDAY|\n", "+---------+-------------+\n", "only showing top 20 rows\n", "\n" ] } ], "source": [ "df_transformed.select(\"weekday\", \"weekday_upper\").show() # Show result" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python3.12 (PySpark3.5.4)", "language": "python", "name": "pyspark" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.8" }, "rise": { "auto_select": "first", "autolaunch": false, "chalkboard": { "chalkEffect": 1, "chalkWidth": 4, "theme": "whiteboard", "transition": 800 }, "enable_chalkboard": true, "reveal_shortcuts": { "chalkboard": { "clear": "ctrl-k" } }, "start_slideshow_at": "selected", "theme": "black" } }, "nbformat": 4, "nbformat_minor": 4 }