{ "cells": [ { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "# Working with Spark DataFrame\n", "\n", "\n", "## Feng Li\n", "\n", "### Guanghua School of Management\n", "### Peking University\n", "\n", "\n", "### [feng.li@gsm.pku.edu.cn](feng.li@gsm.pku.edu.cn)\n", "### Course home page: [https://feng.li/bdcf](https://feng.li/bdcf)" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Start a Spark Session" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "editable": true, "slideshow": { "slide_type": "fragment" }, "tags": [] }, "outputs": [], "source": [ "import os, sys # Ensure All environment variables are properly set \n", "# os.environ[\"JAVA_HOME\"] = os.path.dirname(sys.executable)\n", "os.environ[\"PYSPARK_PYTHON\"] = sys.executable\n", "os.environ[\"PYSPARK_DRIVER_PYTHON\"] = sys.executable\n", "\n", "from pyspark.sql import SparkSession # build Spark Session\n", "spark = SparkSession.builder\\\n", " .config(\"spark.ui.enabled\", \"false\") \\\n", " .config(\"spark.executor.memory\", \"2g\")\\\n", " .config(\"spark.cores.max\", \"2\")\\\n", " .appName(\"Spark DataFrame\").getOrCreate() # using spark server" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "**Note:**\n", "\n", "- If you have trouble starting a PySpark interactive session due to a system limitation, make sure you have disabled Spark UI `.config(\"spark.ui.enabled\", \"false\")`\n", "\n", "- You could also submit your Spark Job via the `spark-submit` command in PKU HPC server (more details later).\n", "\n", " ```bash\n", " export JAVA_HOME=/nfs-share/software/anaconda/2020.02/envs/python3.12/bin\n", " export PYSPARK_PYTHON=/nfs-share/software/anaconda/2020.02/envs/python3.12/bin/python\n", " export PYSPARK_DRIVER_PYTHON=$PYSPARK_PYTHON\n", " /nfs-share/software/anaconda/2020.02/envs/python3.12/bin/spark-submit \\\n", " --conf spark.ui.enabled=false \\\n", " your-pyspark-code.py \n", " ```" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+----------+--------+---------+----+-----+----+----+-------------+------------+------------+------------+-------+-------+-------+\n", "| date|wm_yr_wk| weekday|wday|month|year| d| event_name_1|event_type_1|event_name_2|event_type_2|snap_CA|snap_TX|snap_WI|\n", "+----------+--------+---------+----+-----+----+----+-------------+------------+------------+------------+-------+-------+-------+\n", "|2011-01-29| 11101| Saturday| 1| 1|2011| d_1| NULL| NULL| NULL| NULL| 0| 0| 0|\n", "|2011-01-30| 11101| Sunday| 2| 1|2011| d_2| NULL| NULL| NULL| NULL| 0| 0| 0|\n", "|2011-01-31| 11101| Monday| 3| 1|2011| d_3| NULL| NULL| NULL| NULL| 0| 0| 0|\n", "|2011-02-01| 11101| Tuesday| 4| 2|2011| d_4| NULL| NULL| NULL| NULL| 1| 1| 0|\n", "|2011-02-02| 11101|Wednesday| 5| 2|2011| d_5| NULL| NULL| NULL| NULL| 1| 0| 1|\n", "|2011-02-03| 11101| Thursday| 6| 2|2011| d_6| NULL| NULL| NULL| NULL| 1| 1| 1|\n", "|2011-02-04| 11101| Friday| 7| 2|2011| d_7| NULL| NULL| NULL| NULL| 1| 0| 0|\n", "|2011-02-05| 11102| Saturday| 1| 2|2011| d_8| NULL| NULL| NULL| NULL| 1| 1| 1|\n", "|2011-02-06| 11102| Sunday| 2| 2|2011| d_9| SuperBowl| Sporting| NULL| NULL| 1| 1| 1|\n", "|2011-02-07| 11102| Monday| 3| 2|2011|d_10| NULL| NULL| NULL| NULL| 1| 1| 0|\n", "|2011-02-08| 11102| Tuesday| 4| 2|2011|d_11| NULL| NULL| NULL| NULL| 1| 0| 1|\n", "|2011-02-09| 11102|Wednesday| 5| 2|2011|d_12| NULL| NULL| NULL| NULL| 1| 1| 1|\n", "|2011-02-10| 11102| Thursday| 6| 2|2011|d_13| NULL| NULL| NULL| NULL| 1| 0| 0|\n", "|2011-02-11| 11102| Friday| 7| 2|2011|d_14| NULL| NULL| NULL| NULL| 0| 1| 1|\n", "|2011-02-12| 11103| Saturday| 1| 2|2011|d_15| NULL| NULL| NULL| NULL| 0| 1| 1|\n", "|2011-02-13| 11103| Sunday| 2| 2|2011|d_16| NULL| NULL| NULL| NULL| 0| 1| 0|\n", "|2011-02-14| 11103| Monday| 3| 2|2011|d_17|ValentinesDay| Cultural| NULL| NULL| 0| 0| 1|\n", "|2011-02-15| 11103| Tuesday| 4| 2|2011|d_18| NULL| NULL| NULL| NULL| 0| 1| 1|\n", "|2011-02-16| 11103|Wednesday| 5| 2|2011|d_19| NULL| NULL| NULL| NULL| 0| 0| 0|\n", "|2011-02-17| 11103| Thursday| 6| 2|2011|d_20| NULL| NULL| NULL| NULL| 0| 0| 0|\n", "+----------+--------+---------+----+-----+----+----+-------------+------------+------------+------------+-------+-------+-------+\n", "only showing top 20 rows\n", "\n" ] } ], "source": [ "sdf = spark.read.csv(\"../data/m5-forecasting-accuracy/calendar.csv\", header=True, inferSchema=True) # read files\n", "sdf.show() # Displays the content of the DataFrame to stdout" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Descriptive Statistics" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-------+----------+------------------+---------+------------------+-----------------+------------------+-----+------------+------------+--------------+------------+-------------------+-------------------+-------------------+\n", "|summary| date| wm_yr_wk| weekday| wday| month| year| d|event_name_1|event_type_1| event_name_2|event_type_2| snap_CA| snap_TX| snap_WI|\n", "+-------+----------+------------------+---------+------------------+-----------------+------------------+-----+------------+------------+--------------+------------+-------------------+-------------------+-------------------+\n", "| count| 1969| 1969| 1969| 1969| 1969| 1969| 1969| 162| 162| 5| 5| 1969| 1969| 1969|\n", "| mean| NULL|11347.086338242763| NULL|3.9974606399187405|6.325545962417471| 2013.288471305231| NULL| NULL| NULL| NULL| NULL|0.33011681056373793|0.33011681056373793|0.33011681056373793|\n", "| stddev| NULL| 155.2770428028507| NULL|2.0011413541040746|3.416864338775945|1.5801982706329631| NULL| NULL| NULL| NULL| NULL|0.47037439309734164|0.47037439309734164|0.47037439309734164|\n", "| min|2011-01-29| 11101| Friday| 1| 1| 2011| d_1|Chanukah End| Cultural| Cinco De Mayo| Cultural| 0| 0| 0|\n", "| max|2016-06-19| 11621|Wednesday| 7| 9| 2016|d_999| VeteransDay| Sporting|OrthodoxEaster| Religious| 1| 1| 1|\n", "+-------+----------+------------------+---------+------------------+-----------------+------------------+-----+------------+------------+--------------+------------+-------------------+-------------------+-------------------+\n", "\n" ] } ], "source": [ "sdf.describe().show() # ugly description" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
summarydatewm_yr_wkweekdaywdaymonthyeardevent_name_1event_type_1event_name_2event_type_2snap_CAsnap_TXsnap_WI
0count196919691969196919691969196916216255196919691969
1meanNone11347.086338242763None3.99746063991874056.3255459624174712013.288471305231NoneNoneNoneNoneNone0.330116810563737930.330116810563737930.33011681056373793
2stddevNone155.2770428028507None2.00114135410407463.4168643387759451.5801982706329631NoneNoneNoneNoneNone0.470374393097341640.470374393097341640.47037439309734164
3min2011-01-2911101Friday112011d_1Chanukah EndCulturalCinco De MayoCultural000
4max2016-06-1911621Wednesday792016d_999VeteransDaySportingOrthodoxEasterReligious111
\n", "
" ], "text/plain": [ " summary date wm_yr_wk weekday wday \\\n", "0 count 1969 1969 1969 1969 \n", "1 mean None 11347.086338242763 None 3.9974606399187405 \n", "2 stddev None 155.2770428028507 None 2.0011413541040746 \n", "3 min 2011-01-29 11101 Friday 1 \n", "4 max 2016-06-19 11621 Wednesday 7 \n", "\n", " month year d event_name_1 event_type_1 \\\n", "0 1969 1969 1969 162 162 \n", "1 6.325545962417471 2013.288471305231 None None None \n", "2 3.416864338775945 1.5801982706329631 None None None \n", "3 1 2011 d_1 Chanukah End Cultural \n", "4 9 2016 d_999 VeteransDay Sporting \n", "\n", " event_name_2 event_type_2 snap_CA snap_TX \\\n", "0 5 5 1969 1969 \n", "1 None None 0.33011681056373793 0.33011681056373793 \n", "2 None None 0.47037439309734164 0.47037439309734164 \n", "3 Cinco De Mayo Cultural 0 0 \n", "4 OrthodoxEaster Religious 1 1 \n", "\n", " snap_WI \n", "0 1969 \n", "1 0.33011681056373793 \n", "2 0.47037439309734164 \n", "3 0 \n", "4 1 " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sdf.describe().toPandas() # pretty" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-------+------------------+\n", "|summary| year|\n", "+-------+------------------+\n", "| count| 1969|\n", "| mean| 2013.288471305231|\n", "| stddev|1.5801982706329631|\n", "| min| 2011|\n", "| max| 2016|\n", "+-------+------------------+\n", "\n" ] } ], "source": [ "sdf.describe(['year']).show()" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "### Print the schema in a tree format" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "editable": true, "slideshow": { "slide_type": "fragment" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "root\n", " |-- date: string (nullable = true)\n", " |-- wm_yr_wk: string (nullable = true)\n", " |-- weekday: string (nullable = true)\n", " |-- wday: string (nullable = true)\n", " |-- month: string (nullable = true)\n", " |-- year: string (nullable = true)\n", " |-- d: string (nullable = true)\n", " |-- event_name_1: string (nullable = true)\n", " |-- event_type_1: string (nullable = true)\n", " |-- event_name_2: string (nullable = true)\n", " |-- event_type_2: string (nullable = true)\n", " |-- snap_CA: string (nullable = true)\n", " |-- snap_TX: string (nullable = true)\n", " |-- snap_WI: string (nullable = true)\n", "\n" ] } ], "source": [ "sdf.printSchema()" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "### Select columns" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+----------+----+-------+-------+-------+\n", "| date|year|snap_CA|snap_TX|snap_WI|\n", "+----------+----+-------+-------+-------+\n", "|2011-01-29|2011| 0| 0| 0|\n", "|2011-01-30|2011| 0| 0| 0|\n", "|2011-01-31|2011| 0| 0| 0|\n", "|2011-02-01|2011| 1| 1| 0|\n", "|2011-02-02|2011| 1| 0| 1|\n", "|2011-02-03|2011| 1| 1| 1|\n", "|2011-02-04|2011| 1| 0| 0|\n", "|2011-02-05|2011| 1| 1| 1|\n", "|2011-02-06|2011| 1| 1| 1|\n", "|2011-02-07|2011| 1| 1| 0|\n", "|2011-02-08|2011| 1| 0| 1|\n", "|2011-02-09|2011| 1| 1| 1|\n", "|2011-02-10|2011| 1| 0| 0|\n", "|2011-02-11|2011| 0| 1| 1|\n", "|2011-02-12|2011| 0| 1| 1|\n", "|2011-02-13|2011| 0| 1| 0|\n", "|2011-02-14|2011| 0| 0| 1|\n", "|2011-02-15|2011| 0| 1| 1|\n", "|2011-02-16|2011| 0| 0| 0|\n", "|2011-02-17|2011| 0| 0| 0|\n", "+----------+----+-------+-------+-------+\n", "only showing top 20 rows\n", "\n" ] } ], "source": [ "sdf.select([\"date\",\"year\",\"snap_CA\", \"snap_TX\", \"snap_WI\"]).show()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+----+------------+\n", "|year|(month > 11)|\n", "+----+------------+\n", "|2011| false|\n", "|2011| false|\n", "|2011| false|\n", "|2011| false|\n", "|2011| false|\n", "|2011| false|\n", "|2011| false|\n", "|2011| false|\n", "|2011| false|\n", "|2011| false|\n", "|2011| false|\n", "|2011| false|\n", "|2011| false|\n", "|2011| false|\n", "|2011| false|\n", "|2011| false|\n", "|2011| false|\n", "|2011| false|\n", "|2011| false|\n", "|2011| false|\n", "+----+------------+\n", "only showing top 20 rows\n", "\n" ] } ], "source": [ "sdf.select(sdf['year'], sdf['month']>11).show()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-----+---------+-----+\n", "|month| weekday|count|\n", "+-----+---------+-----+\n", "| 3| Monday| 26|\n", "| 11| Saturday| 22|\n", "| 10| Saturday| 22|\n", "| 4|Wednesday| 26|\n", "| 1| Monday| 22|\n", "| 2| Friday| 24|\n", "| 11| Sunday| 22|\n", "| 4| Sunday| 25|\n", "| 8| Sunday| 22|\n", "| 6| Monday| 24|\n", "| 9| Monday| 22|\n", "| 4| Friday| 26|\n", "| 1| Friday| 23|\n", "| 7| Monday| 22|\n", "| 4| Tuesday| 26|\n", "| 6| Thursday| 24|\n", "| 1|Wednesday| 22|\n", "| 4| Monday| 26|\n", "| 3| Tuesday| 27|\n", "| 3| Friday| 26|\n", "+-----+---------+-----+\n", "only showing top 20 rows\n", "\n" ] } ], "source": [ "# group data with respect to some columns \n", "sdf.groupBy([\"month\",\"weekday\"]).count().show() " ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
monthweekdaycount
03Monday26
111Saturday22
210Saturday22
34Wednesday26
41Monday22
............
795Thursday27
802Wednesday25
811Thursday23
822Monday25
837Wednesday23
\n", "

84 rows × 3 columns

\n", "
" ], "text/plain": [ " month weekday count\n", "0 3 Monday 26\n", "1 11 Saturday 22\n", "2 10 Saturday 22\n", "3 4 Wednesday 26\n", "4 1 Monday 22\n", ".. ... ... ...\n", "79 5 Thursday 27\n", "80 2 Wednesday 25\n", "81 1 Thursday 23\n", "82 2 Monday 25\n", "83 7 Wednesday 23\n", "\n", "[84 rows x 3 columns]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sdf.groupBy([\"month\",\"weekday\"]).count().toPandas()" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+------------+-----+\n", "|event_type_1|count|\n", "+------------+-----+\n", "| NULL| 1807|\n", "| Religious| 55|\n", "| National| 52|\n", "| Cultural| 37|\n", "| Sporting| 18|\n", "+------------+-----+\n", "\n" ] } ], "source": [ "## Group and sort\n", "event1count=sdf.groupBy(\"event_type_1\").count()\n", "event1count.sort(\"count\", ascending=False).show()" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "### Data cleaning" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "1969" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## Returns a new DataFrame containing the distinct rows in this DataFrame.\n", "## Takes a while to compute\n", "sdf.distinct().count()" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+----------+--------+-------+----+-----+----+------+--------------+------------+--------------+------------+-------+-------+-------+\n", "| date|wm_yr_wk|weekday|wday|month|year| d| event_name_1|event_type_1| event_name_2|event_type_2|snap_CA|snap_TX|snap_WI|\n", "+----------+--------+-------+----+-----+----+------+--------------+------------+--------------+------------+-------+-------+-------+\n", "|2011-04-24| 11113| Sunday| 2| 4|2011| d_86|OrthodoxEaster| Religious| Easter| Cultural| 0| 0| 0|\n", "|2013-05-05| 11315| Sunday| 2| 5|2013| d_828|OrthodoxEaster| Religious| Cinco De Mayo| Cultural| 1| 1| 1|\n", "|2014-04-20| 11412| Sunday| 2| 4|2014|d_1178| Easter| Cultural|OrthodoxEaster| Religious| 0| 0| 0|\n", "|2014-06-15| 11420| Sunday| 2| 6|2014|d_1234| NBAFinalsEnd| Sporting| Father's day| Cultural| 0| 1| 1|\n", "|2016-06-19| 11621| Sunday| 2| 6|2016|d_1969| NBAFinalsEnd| Sporting| Father's day| Cultural| 0| 0| 0|\n", "+----------+--------+-------+----+-----+----+------+--------------+------------+--------------+------------+-------+-------+-------+\n", "\n" ] } ], "source": [ "## Returns a new DataFrame omitting rows with null values\n", "sdf_without_na = sdf.na.drop()\n", "sdf_without_na.show()" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "1969" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sdf.count() # original file size" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+----------+--------+---------+----+-----+----+----+-------------+------------+------------+------------+-------+-------+-------+\n", "| date|wm_yr_wk| weekday|wday|month|year| d| event_name_1|event_type_1|event_name_2|event_type_2|snap_CA|snap_TX|snap_WI|\n", "+----------+--------+---------+----+-----+----+----+-------------+------------+------------+------------+-------+-------+-------+\n", "|2011-01-29| 11101| Saturday| 1| 1|2011| d_1| unknown| unknown| unknown| unknown| 0| 0| 0|\n", "|2011-01-30| 11101| Sunday| 2| 1|2011| d_2| unknown| unknown| unknown| unknown| 0| 0| 0|\n", "|2011-01-31| 11101| Monday| 3| 1|2011| d_3| unknown| unknown| unknown| unknown| 0| 0| 0|\n", "|2011-02-01| 11101| Tuesday| 4| 2|2011| d_4| unknown| unknown| unknown| unknown| 1| 1| 0|\n", "|2011-02-02| 11101|Wednesday| 5| 2|2011| d_5| unknown| unknown| unknown| unknown| 1| 0| 1|\n", "|2011-02-03| 11101| Thursday| 6| 2|2011| d_6| unknown| unknown| unknown| unknown| 1| 1| 1|\n", "|2011-02-04| 11101| Friday| 7| 2|2011| d_7| unknown| unknown| unknown| unknown| 1| 0| 0|\n", "|2011-02-05| 11102| Saturday| 1| 2|2011| d_8| unknown| unknown| unknown| unknown| 1| 1| 1|\n", "|2011-02-06| 11102| Sunday| 2| 2|2011| d_9| SuperBowl| Sporting| unknown| unknown| 1| 1| 1|\n", "|2011-02-07| 11102| Monday| 3| 2|2011|d_10| unknown| unknown| unknown| unknown| 1| 1| 0|\n", "|2011-02-08| 11102| Tuesday| 4| 2|2011|d_11| unknown| unknown| unknown| unknown| 1| 0| 1|\n", "|2011-02-09| 11102|Wednesday| 5| 2|2011|d_12| unknown| unknown| unknown| unknown| 1| 1| 1|\n", "|2011-02-10| 11102| Thursday| 6| 2|2011|d_13| unknown| unknown| unknown| unknown| 1| 0| 0|\n", "|2011-02-11| 11102| Friday| 7| 2|2011|d_14| unknown| unknown| unknown| unknown| 0| 1| 1|\n", "|2011-02-12| 11103| Saturday| 1| 2|2011|d_15| unknown| unknown| unknown| unknown| 0| 1| 1|\n", "|2011-02-13| 11103| Sunday| 2| 2|2011|d_16| unknown| unknown| unknown| unknown| 0| 1| 0|\n", "|2011-02-14| 11103| Monday| 3| 2|2011|d_17|ValentinesDay| Cultural| unknown| unknown| 0| 0| 1|\n", "|2011-02-15| 11103| Tuesday| 4| 2|2011|d_18| unknown| unknown| unknown| unknown| 0| 1| 1|\n", "|2011-02-16| 11103|Wednesday| 5| 2|2011|d_19| unknown| unknown| unknown| unknown| 0| 0| 0|\n", "|2011-02-17| 11103| Thursday| 6| 2|2011|d_20| unknown| unknown| unknown| unknown| 0| 0| 0|\n", "+----------+--------+---------+----+-----+----+----+-------------+------------+------------+------------+-------+-------+-------+\n", "only showing top 20 rows\n", "\n" ] } ], "source": [ "## Replace null values\n", "sdf.na.fill(\"unknown\").show()" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## User-defined functions" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [], "source": [ "from pyspark.sql.functions import udf\n", "from pyspark.sql.types import StringType\n", "\n", "# Define a simple UDF to convert strings to uppercase\n", "def to_uppercase(s):\n", " return s.upper() if s else None\n", "\n", "# Register UDF\n", "uppercase_udf = udf(to_uppercase, StringType())\n", "\n", "# Apply UDF to the 'weekday' column\n", "df_transformed = sdf.withColumn(\"weekday_upper\", uppercase_udf(sdf[\"weekday\"]))\n" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+---------+-------------+\n", "| weekday|weekday_upper|\n", "+---------+-------------+\n", "| Saturday| SATURDAY|\n", "| Sunday| SUNDAY|\n", "| Monday| MONDAY|\n", "| Tuesday| TUESDAY|\n", "|Wednesday| WEDNESDAY|\n", "| Thursday| THURSDAY|\n", "| Friday| FRIDAY|\n", "| Saturday| SATURDAY|\n", "| Sunday| SUNDAY|\n", "| Monday| MONDAY|\n", "| Tuesday| TUESDAY|\n", "|Wednesday| WEDNESDAY|\n", "| Thursday| THURSDAY|\n", "| Friday| FRIDAY|\n", "| Saturday| SATURDAY|\n", "| Sunday| SUNDAY|\n", "| Monday| MONDAY|\n", "| Tuesday| TUESDAY|\n", "|Wednesday| WEDNESDAY|\n", "| Thursday| THURSDAY|\n", "+---------+-------------+\n", "only showing top 20 rows\n", "\n" ] } ], "source": [ "df_transformed.select(\"weekday\", \"weekday_upper\").show() # Show result" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python3.12 (PySpark3.5.4)", "language": "python", "name": "pyspark" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.8" }, "rise": { "auto_select": "first", "autolaunch": false, "chalkboard": { "chalkEffect": 1, "chalkWidth": 4, "theme": "whiteboard", "transition": 800 }, "enable_chalkboard": true, "reveal_shortcuts": { "chalkboard": { "clear": "ctrl-k" } }, "start_slideshow_at": "selected", "theme": "black" } }, "nbformat": 4, "nbformat_minor": 4 }