{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "# Working with Spark DataFrame\n",
    "\n",
    "\n",
    "## Feng Li\n",
    "\n",
    "### Guanghua School of Management\n",
    "### Peking University\n",
    "\n",
    "\n",
    "### [feng.li@gsm.pku.edu.cn](feng.li@gsm.pku.edu.cn)\n",
    "### Course home page: [https://feng.li/bdcf](https://feng.li/bdcf)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "## Start a Spark Session"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "fragment"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "import os, sys # Ensure All environment variables are properly set \n",
    "# os.environ[\"JAVA_HOME\"] = os.path.dirname(sys.executable)\n",
    "os.environ[\"PYSPARK_PYTHON\"] = sys.executable\n",
    "os.environ[\"PYSPARK_DRIVER_PYTHON\"] = sys.executable\n",
    "\n",
    "from pyspark.sql import SparkSession # build Spark Session\n",
    "spark = SparkSession.builder\\\n",
    "        .config(\"spark.ui.enabled\", \"false\")  \\\n",
    "        .config(\"spark.executor.memory\", \"2g\")\\\n",
    "        .config(\"spark.cores.max\", \"2\")\\\n",
    "        .appName(\"Spark DataFrame\").getOrCreate() # using spark server"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "**Note:**\n",
    "\n",
    "- If you have trouble starting a PySpark interactive session due to a system limitation, make sure you have disabled Spark UI `.config(\"spark.ui.enabled\", \"false\")`\n",
    "\n",
    "- You could also submit your Spark Job via the `spark-submit` command in PKU HPC server (more details later).\n",
    "\n",
    "    ```bash\n",
    "    export JAVA_HOME=/nfs-share/software/anaconda/2020.02/envs/python3.12/bin\n",
    "    export PYSPARK_PYTHON=/nfs-share/software/anaconda/2020.02/envs/python3.12/bin/python\n",
    "    export PYSPARK_DRIVER_PYTHON=$PYSPARK_PYTHON\n",
    "    /nfs-share/software/anaconda/2020.02/envs/python3.12/bin/spark-submit \\\n",
    "        --conf spark.ui.enabled=false     \\\n",
    "        your-pyspark-code.py         \n",
    "    ```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+----------+--------+---------+----+-----+----+----+-------------+------------+------------+------------+-------+-------+-------+\n",
      "|      date|wm_yr_wk|  weekday|wday|month|year|   d| event_name_1|event_type_1|event_name_2|event_type_2|snap_CA|snap_TX|snap_WI|\n",
      "+----------+--------+---------+----+-----+----+----+-------------+------------+------------+------------+-------+-------+-------+\n",
      "|2011-01-29|   11101| Saturday|   1|    1|2011| d_1|         NULL|        NULL|        NULL|        NULL|      0|      0|      0|\n",
      "|2011-01-30|   11101|   Sunday|   2|    1|2011| d_2|         NULL|        NULL|        NULL|        NULL|      0|      0|      0|\n",
      "|2011-01-31|   11101|   Monday|   3|    1|2011| d_3|         NULL|        NULL|        NULL|        NULL|      0|      0|      0|\n",
      "|2011-02-01|   11101|  Tuesday|   4|    2|2011| d_4|         NULL|        NULL|        NULL|        NULL|      1|      1|      0|\n",
      "|2011-02-02|   11101|Wednesday|   5|    2|2011| d_5|         NULL|        NULL|        NULL|        NULL|      1|      0|      1|\n",
      "|2011-02-03|   11101| Thursday|   6|    2|2011| d_6|         NULL|        NULL|        NULL|        NULL|      1|      1|      1|\n",
      "|2011-02-04|   11101|   Friday|   7|    2|2011| d_7|         NULL|        NULL|        NULL|        NULL|      1|      0|      0|\n",
      "|2011-02-05|   11102| Saturday|   1|    2|2011| d_8|         NULL|        NULL|        NULL|        NULL|      1|      1|      1|\n",
      "|2011-02-06|   11102|   Sunday|   2|    2|2011| d_9|    SuperBowl|    Sporting|        NULL|        NULL|      1|      1|      1|\n",
      "|2011-02-07|   11102|   Monday|   3|    2|2011|d_10|         NULL|        NULL|        NULL|        NULL|      1|      1|      0|\n",
      "|2011-02-08|   11102|  Tuesday|   4|    2|2011|d_11|         NULL|        NULL|        NULL|        NULL|      1|      0|      1|\n",
      "|2011-02-09|   11102|Wednesday|   5|    2|2011|d_12|         NULL|        NULL|        NULL|        NULL|      1|      1|      1|\n",
      "|2011-02-10|   11102| Thursday|   6|    2|2011|d_13|         NULL|        NULL|        NULL|        NULL|      1|      0|      0|\n",
      "|2011-02-11|   11102|   Friday|   7|    2|2011|d_14|         NULL|        NULL|        NULL|        NULL|      0|      1|      1|\n",
      "|2011-02-12|   11103| Saturday|   1|    2|2011|d_15|         NULL|        NULL|        NULL|        NULL|      0|      1|      1|\n",
      "|2011-02-13|   11103|   Sunday|   2|    2|2011|d_16|         NULL|        NULL|        NULL|        NULL|      0|      1|      0|\n",
      "|2011-02-14|   11103|   Monday|   3|    2|2011|d_17|ValentinesDay|    Cultural|        NULL|        NULL|      0|      0|      1|\n",
      "|2011-02-15|   11103|  Tuesday|   4|    2|2011|d_18|         NULL|        NULL|        NULL|        NULL|      0|      1|      1|\n",
      "|2011-02-16|   11103|Wednesday|   5|    2|2011|d_19|         NULL|        NULL|        NULL|        NULL|      0|      0|      0|\n",
      "|2011-02-17|   11103| Thursday|   6|    2|2011|d_20|         NULL|        NULL|        NULL|        NULL|      0|      0|      0|\n",
      "+----------+--------+---------+----+-----+----+----+-------------+------------+------------+------------+-------+-------+-------+\n",
      "only showing top 20 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "sdf = spark.read.csv(\"../data/m5-forecasting-accuracy/calendar.csv\", header=True, inferSchema=True) # read files\n",
    "sdf.show() # Displays the content of the DataFrame to stdout"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "## Descriptive  Statistics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------+----------+------------------+---------+------------------+-----------------+------------------+-----+------------+------------+--------------+------------+-------------------+-------------------+-------------------+\n",
      "|summary|      date|          wm_yr_wk|  weekday|              wday|            month|              year|    d|event_name_1|event_type_1|  event_name_2|event_type_2|            snap_CA|            snap_TX|            snap_WI|\n",
      "+-------+----------+------------------+---------+------------------+-----------------+------------------+-----+------------+------------+--------------+------------+-------------------+-------------------+-------------------+\n",
      "|  count|      1969|              1969|     1969|              1969|             1969|              1969| 1969|         162|         162|             5|           5|               1969|               1969|               1969|\n",
      "|   mean|      NULL|11347.086338242763|     NULL|3.9974606399187405|6.325545962417471| 2013.288471305231| NULL|        NULL|        NULL|          NULL|        NULL|0.33011681056373793|0.33011681056373793|0.33011681056373793|\n",
      "| stddev|      NULL| 155.2770428028507|     NULL|2.0011413541040746|3.416864338775945|1.5801982706329631| NULL|        NULL|        NULL|          NULL|        NULL|0.47037439309734164|0.47037439309734164|0.47037439309734164|\n",
      "|    min|2011-01-29|             11101|   Friday|                 1|                1|              2011|  d_1|Chanukah End|    Cultural| Cinco De Mayo|    Cultural|                  0|                  0|                  0|\n",
      "|    max|2016-06-19|             11621|Wednesday|                 7|                9|              2016|d_999| VeteransDay|    Sporting|OrthodoxEaster|   Religious|                  1|                  1|                  1|\n",
      "+-------+----------+------------------+---------+------------------+-----------------+------------------+-----+------------+------------+--------------+------------+-------------------+-------------------+-------------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "sdf.describe().show() # ugly description"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>summary</th>\n",
       "      <th>date</th>\n",
       "      <th>wm_yr_wk</th>\n",
       "      <th>weekday</th>\n",
       "      <th>wday</th>\n",
       "      <th>month</th>\n",
       "      <th>year</th>\n",
       "      <th>d</th>\n",
       "      <th>event_name_1</th>\n",
       "      <th>event_type_1</th>\n",
       "      <th>event_name_2</th>\n",
       "      <th>event_type_2</th>\n",
       "      <th>snap_CA</th>\n",
       "      <th>snap_TX</th>\n",
       "      <th>snap_WI</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>count</td>\n",
       "      <td>1969</td>\n",
       "      <td>1969</td>\n",
       "      <td>1969</td>\n",
       "      <td>1969</td>\n",
       "      <td>1969</td>\n",
       "      <td>1969</td>\n",
       "      <td>1969</td>\n",
       "      <td>162</td>\n",
       "      <td>162</td>\n",
       "      <td>5</td>\n",
       "      <td>5</td>\n",
       "      <td>1969</td>\n",
       "      <td>1969</td>\n",
       "      <td>1969</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>mean</td>\n",
       "      <td>None</td>\n",
       "      <td>11347.086338242763</td>\n",
       "      <td>None</td>\n",
       "      <td>3.9974606399187405</td>\n",
       "      <td>6.325545962417471</td>\n",
       "      <td>2013.288471305231</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>0.33011681056373793</td>\n",
       "      <td>0.33011681056373793</td>\n",
       "      <td>0.33011681056373793</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>stddev</td>\n",
       "      <td>None</td>\n",
       "      <td>155.2770428028507</td>\n",
       "      <td>None</td>\n",
       "      <td>2.0011413541040746</td>\n",
       "      <td>3.416864338775945</td>\n",
       "      <td>1.5801982706329631</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>0.47037439309734164</td>\n",
       "      <td>0.47037439309734164</td>\n",
       "      <td>0.47037439309734164</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>min</td>\n",
       "      <td>2011-01-29</td>\n",
       "      <td>11101</td>\n",
       "      <td>Friday</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2011</td>\n",
       "      <td>d_1</td>\n",
       "      <td>Chanukah End</td>\n",
       "      <td>Cultural</td>\n",
       "      <td>Cinco De Mayo</td>\n",
       "      <td>Cultural</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>max</td>\n",
       "      <td>2016-06-19</td>\n",
       "      <td>11621</td>\n",
       "      <td>Wednesday</td>\n",
       "      <td>7</td>\n",
       "      <td>9</td>\n",
       "      <td>2016</td>\n",
       "      <td>d_999</td>\n",
       "      <td>VeteransDay</td>\n",
       "      <td>Sporting</td>\n",
       "      <td>OrthodoxEaster</td>\n",
       "      <td>Religious</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  summary        date            wm_yr_wk    weekday                wday  \\\n",
       "0   count        1969                1969       1969                1969   \n",
       "1    mean        None  11347.086338242763       None  3.9974606399187405   \n",
       "2  stddev        None   155.2770428028507       None  2.0011413541040746   \n",
       "3     min  2011-01-29               11101     Friday                   1   \n",
       "4     max  2016-06-19               11621  Wednesday                   7   \n",
       "\n",
       "               month                year      d  event_name_1 event_type_1  \\\n",
       "0               1969                1969   1969           162          162   \n",
       "1  6.325545962417471   2013.288471305231   None          None         None   \n",
       "2  3.416864338775945  1.5801982706329631   None          None         None   \n",
       "3                  1                2011    d_1  Chanukah End     Cultural   \n",
       "4                  9                2016  d_999   VeteransDay     Sporting   \n",
       "\n",
       "     event_name_2 event_type_2              snap_CA              snap_TX  \\\n",
       "0               5            5                 1969                 1969   \n",
       "1            None         None  0.33011681056373793  0.33011681056373793   \n",
       "2            None         None  0.47037439309734164  0.47037439309734164   \n",
       "3   Cinco De Mayo     Cultural                    0                    0   \n",
       "4  OrthodoxEaster    Religious                    1                    1   \n",
       "\n",
       "               snap_WI  \n",
       "0                 1969  \n",
       "1  0.33011681056373793  \n",
       "2  0.47037439309734164  \n",
       "3                    0  \n",
       "4                    1  "
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sdf.describe().toPandas() # pretty"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------+------------------+\n",
      "|summary|              year|\n",
      "+-------+------------------+\n",
      "|  count|              1969|\n",
      "|   mean| 2013.288471305231|\n",
      "| stddev|1.5801982706329631|\n",
      "|    min|              2011|\n",
      "|    max|              2016|\n",
      "+-------+------------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "sdf.describe(['year']).show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "### Print the schema in a tree format"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "fragment"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "root\n",
      " |-- date: string (nullable = true)\n",
      " |-- wm_yr_wk: string (nullable = true)\n",
      " |-- weekday: string (nullable = true)\n",
      " |-- wday: string (nullable = true)\n",
      " |-- month: string (nullable = true)\n",
      " |-- year: string (nullable = true)\n",
      " |-- d: string (nullable = true)\n",
      " |-- event_name_1: string (nullable = true)\n",
      " |-- event_type_1: string (nullable = true)\n",
      " |-- event_name_2: string (nullable = true)\n",
      " |-- event_type_2: string (nullable = true)\n",
      " |-- snap_CA: string (nullable = true)\n",
      " |-- snap_TX: string (nullable = true)\n",
      " |-- snap_WI: string (nullable = true)\n",
      "\n"
     ]
    }
   ],
   "source": [
    "sdf.printSchema()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "### Select columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+----------+----+-------+-------+-------+\n",
      "|      date|year|snap_CA|snap_TX|snap_WI|\n",
      "+----------+----+-------+-------+-------+\n",
      "|2011-01-29|2011|      0|      0|      0|\n",
      "|2011-01-30|2011|      0|      0|      0|\n",
      "|2011-01-31|2011|      0|      0|      0|\n",
      "|2011-02-01|2011|      1|      1|      0|\n",
      "|2011-02-02|2011|      1|      0|      1|\n",
      "|2011-02-03|2011|      1|      1|      1|\n",
      "|2011-02-04|2011|      1|      0|      0|\n",
      "|2011-02-05|2011|      1|      1|      1|\n",
      "|2011-02-06|2011|      1|      1|      1|\n",
      "|2011-02-07|2011|      1|      1|      0|\n",
      "|2011-02-08|2011|      1|      0|      1|\n",
      "|2011-02-09|2011|      1|      1|      1|\n",
      "|2011-02-10|2011|      1|      0|      0|\n",
      "|2011-02-11|2011|      0|      1|      1|\n",
      "|2011-02-12|2011|      0|      1|      1|\n",
      "|2011-02-13|2011|      0|      1|      0|\n",
      "|2011-02-14|2011|      0|      0|      1|\n",
      "|2011-02-15|2011|      0|      1|      1|\n",
      "|2011-02-16|2011|      0|      0|      0|\n",
      "|2011-02-17|2011|      0|      0|      0|\n",
      "+----------+----+-------+-------+-------+\n",
      "only showing top 20 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "sdf.select([\"date\",\"year\",\"snap_CA\", \"snap_TX\", \"snap_WI\"]).show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+----+------------+\n",
      "|year|(month > 11)|\n",
      "+----+------------+\n",
      "|2011|       false|\n",
      "|2011|       false|\n",
      "|2011|       false|\n",
      "|2011|       false|\n",
      "|2011|       false|\n",
      "|2011|       false|\n",
      "|2011|       false|\n",
      "|2011|       false|\n",
      "|2011|       false|\n",
      "|2011|       false|\n",
      "|2011|       false|\n",
      "|2011|       false|\n",
      "|2011|       false|\n",
      "|2011|       false|\n",
      "|2011|       false|\n",
      "|2011|       false|\n",
      "|2011|       false|\n",
      "|2011|       false|\n",
      "|2011|       false|\n",
      "|2011|       false|\n",
      "+----+------------+\n",
      "only showing top 20 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "sdf.select(sdf['year'], sdf['month']>11).show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-----+---------+-----+\n",
      "|month|  weekday|count|\n",
      "+-----+---------+-----+\n",
      "|    3|   Monday|   26|\n",
      "|   11| Saturday|   22|\n",
      "|   10| Saturday|   22|\n",
      "|    4|Wednesday|   26|\n",
      "|    1|   Monday|   22|\n",
      "|    2|   Friday|   24|\n",
      "|   11|   Sunday|   22|\n",
      "|    4|   Sunday|   25|\n",
      "|    8|   Sunday|   22|\n",
      "|    6|   Monday|   24|\n",
      "|    9|   Monday|   22|\n",
      "|    4|   Friday|   26|\n",
      "|    1|   Friday|   23|\n",
      "|    7|   Monday|   22|\n",
      "|    4|  Tuesday|   26|\n",
      "|    6| Thursday|   24|\n",
      "|    1|Wednesday|   22|\n",
      "|    4|   Monday|   26|\n",
      "|    3|  Tuesday|   27|\n",
      "|    3|   Friday|   26|\n",
      "+-----+---------+-----+\n",
      "only showing top 20 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# group data with respect to some columns \n",
    "sdf.groupBy([\"month\",\"weekday\"]).count().show() "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>month</th>\n",
       "      <th>weekday</th>\n",
       "      <th>count</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>3</td>\n",
       "      <td>Monday</td>\n",
       "      <td>26</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>11</td>\n",
       "      <td>Saturday</td>\n",
       "      <td>22</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>10</td>\n",
       "      <td>Saturday</td>\n",
       "      <td>22</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>Wednesday</td>\n",
       "      <td>26</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>Monday</td>\n",
       "      <td>22</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>79</th>\n",
       "      <td>5</td>\n",
       "      <td>Thursday</td>\n",
       "      <td>27</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>80</th>\n",
       "      <td>2</td>\n",
       "      <td>Wednesday</td>\n",
       "      <td>25</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>81</th>\n",
       "      <td>1</td>\n",
       "      <td>Thursday</td>\n",
       "      <td>23</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>82</th>\n",
       "      <td>2</td>\n",
       "      <td>Monday</td>\n",
       "      <td>25</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>83</th>\n",
       "      <td>7</td>\n",
       "      <td>Wednesday</td>\n",
       "      <td>23</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>84 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   month    weekday  count\n",
       "0      3     Monday     26\n",
       "1     11   Saturday     22\n",
       "2     10   Saturday     22\n",
       "3      4  Wednesday     26\n",
       "4      1     Monday     22\n",
       "..   ...        ...    ...\n",
       "79     5   Thursday     27\n",
       "80     2  Wednesday     25\n",
       "81     1   Thursday     23\n",
       "82     2     Monday     25\n",
       "83     7  Wednesday     23\n",
       "\n",
       "[84 rows x 3 columns]"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sdf.groupBy([\"month\",\"weekday\"]).count().toPandas()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+------------+-----+\n",
      "|event_type_1|count|\n",
      "+------------+-----+\n",
      "|        NULL| 1807|\n",
      "|   Religious|   55|\n",
      "|    National|   52|\n",
      "|    Cultural|   37|\n",
      "|    Sporting|   18|\n",
      "+------------+-----+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "## Group and sort\n",
    "event1count=sdf.groupBy(\"event_type_1\").count()\n",
    "event1count.sort(\"count\", ascending=False).show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "### Data cleaning"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1969"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## Returns a new DataFrame containing the distinct rows in this DataFrame.\n",
    "## Takes a while to compute\n",
    "sdf.distinct().count()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+----------+--------+-------+----+-----+----+------+--------------+------------+--------------+------------+-------+-------+-------+\n",
      "|      date|wm_yr_wk|weekday|wday|month|year|     d|  event_name_1|event_type_1|  event_name_2|event_type_2|snap_CA|snap_TX|snap_WI|\n",
      "+----------+--------+-------+----+-----+----+------+--------------+------------+--------------+------------+-------+-------+-------+\n",
      "|2011-04-24|   11113| Sunday|   2|    4|2011|  d_86|OrthodoxEaster|   Religious|        Easter|    Cultural|      0|      0|      0|\n",
      "|2013-05-05|   11315| Sunday|   2|    5|2013| d_828|OrthodoxEaster|   Religious| Cinco De Mayo|    Cultural|      1|      1|      1|\n",
      "|2014-04-20|   11412| Sunday|   2|    4|2014|d_1178|        Easter|    Cultural|OrthodoxEaster|   Religious|      0|      0|      0|\n",
      "|2014-06-15|   11420| Sunday|   2|    6|2014|d_1234|  NBAFinalsEnd|    Sporting|  Father's day|    Cultural|      0|      1|      1|\n",
      "|2016-06-19|   11621| Sunday|   2|    6|2016|d_1969|  NBAFinalsEnd|    Sporting|  Father's day|    Cultural|      0|      0|      0|\n",
      "+----------+--------+-------+----+-----+----+------+--------------+------------+--------------+------------+-------+-------+-------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "## Returns a new DataFrame omitting rows with null values\n",
    "sdf_without_na = sdf.na.drop()\n",
    "sdf_without_na.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1969"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sdf.count() # original file size"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+----------+--------+---------+----+-----+----+----+-------------+------------+------------+------------+-------+-------+-------+\n",
      "|      date|wm_yr_wk|  weekday|wday|month|year|   d| event_name_1|event_type_1|event_name_2|event_type_2|snap_CA|snap_TX|snap_WI|\n",
      "+----------+--------+---------+----+-----+----+----+-------------+------------+------------+------------+-------+-------+-------+\n",
      "|2011-01-29|   11101| Saturday|   1|    1|2011| d_1|      unknown|     unknown|     unknown|     unknown|      0|      0|      0|\n",
      "|2011-01-30|   11101|   Sunday|   2|    1|2011| d_2|      unknown|     unknown|     unknown|     unknown|      0|      0|      0|\n",
      "|2011-01-31|   11101|   Monday|   3|    1|2011| d_3|      unknown|     unknown|     unknown|     unknown|      0|      0|      0|\n",
      "|2011-02-01|   11101|  Tuesday|   4|    2|2011| d_4|      unknown|     unknown|     unknown|     unknown|      1|      1|      0|\n",
      "|2011-02-02|   11101|Wednesday|   5|    2|2011| d_5|      unknown|     unknown|     unknown|     unknown|      1|      0|      1|\n",
      "|2011-02-03|   11101| Thursday|   6|    2|2011| d_6|      unknown|     unknown|     unknown|     unknown|      1|      1|      1|\n",
      "|2011-02-04|   11101|   Friday|   7|    2|2011| d_7|      unknown|     unknown|     unknown|     unknown|      1|      0|      0|\n",
      "|2011-02-05|   11102| Saturday|   1|    2|2011| d_8|      unknown|     unknown|     unknown|     unknown|      1|      1|      1|\n",
      "|2011-02-06|   11102|   Sunday|   2|    2|2011| d_9|    SuperBowl|    Sporting|     unknown|     unknown|      1|      1|      1|\n",
      "|2011-02-07|   11102|   Monday|   3|    2|2011|d_10|      unknown|     unknown|     unknown|     unknown|      1|      1|      0|\n",
      "|2011-02-08|   11102|  Tuesday|   4|    2|2011|d_11|      unknown|     unknown|     unknown|     unknown|      1|      0|      1|\n",
      "|2011-02-09|   11102|Wednesday|   5|    2|2011|d_12|      unknown|     unknown|     unknown|     unknown|      1|      1|      1|\n",
      "|2011-02-10|   11102| Thursday|   6|    2|2011|d_13|      unknown|     unknown|     unknown|     unknown|      1|      0|      0|\n",
      "|2011-02-11|   11102|   Friday|   7|    2|2011|d_14|      unknown|     unknown|     unknown|     unknown|      0|      1|      1|\n",
      "|2011-02-12|   11103| Saturday|   1|    2|2011|d_15|      unknown|     unknown|     unknown|     unknown|      0|      1|      1|\n",
      "|2011-02-13|   11103|   Sunday|   2|    2|2011|d_16|      unknown|     unknown|     unknown|     unknown|      0|      1|      0|\n",
      "|2011-02-14|   11103|   Monday|   3|    2|2011|d_17|ValentinesDay|    Cultural|     unknown|     unknown|      0|      0|      1|\n",
      "|2011-02-15|   11103|  Tuesday|   4|    2|2011|d_18|      unknown|     unknown|     unknown|     unknown|      0|      1|      1|\n",
      "|2011-02-16|   11103|Wednesday|   5|    2|2011|d_19|      unknown|     unknown|     unknown|     unknown|      0|      0|      0|\n",
      "|2011-02-17|   11103| Thursday|   6|    2|2011|d_20|      unknown|     unknown|     unknown|     unknown|      0|      0|      0|\n",
      "+----------+--------+---------+----+-----+----+----+-------------+------------+------------+------------+-------+-------+-------+\n",
      "only showing top 20 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "## Replace null values\n",
    "sdf.na.fill(\"unknown\").show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "## User-defined functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "from pyspark.sql.functions import udf\n",
    "from pyspark.sql.types import StringType\n",
    "\n",
    "# Define a simple UDF to convert strings to uppercase\n",
    "def to_uppercase(s):\n",
    "    return s.upper() if s else None\n",
    "\n",
    "# Register UDF\n",
    "uppercase_udf = udf(to_uppercase, StringType())\n",
    "\n",
    "# Apply UDF to the 'weekday' column\n",
    "df_transformed = sdf.withColumn(\"weekday_upper\", uppercase_udf(sdf[\"weekday\"]))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+---------+-------------+\n",
      "|  weekday|weekday_upper|\n",
      "+---------+-------------+\n",
      "| Saturday|     SATURDAY|\n",
      "|   Sunday|       SUNDAY|\n",
      "|   Monday|       MONDAY|\n",
      "|  Tuesday|      TUESDAY|\n",
      "|Wednesday|    WEDNESDAY|\n",
      "| Thursday|     THURSDAY|\n",
      "|   Friday|       FRIDAY|\n",
      "| Saturday|     SATURDAY|\n",
      "|   Sunday|       SUNDAY|\n",
      "|   Monday|       MONDAY|\n",
      "|  Tuesday|      TUESDAY|\n",
      "|Wednesday|    WEDNESDAY|\n",
      "| Thursday|     THURSDAY|\n",
      "|   Friday|       FRIDAY|\n",
      "| Saturday|     SATURDAY|\n",
      "|   Sunday|       SUNDAY|\n",
      "|   Monday|       MONDAY|\n",
      "|  Tuesday|      TUESDAY|\n",
      "|Wednesday|    WEDNESDAY|\n",
      "| Thursday|     THURSDAY|\n",
      "+---------+-------------+\n",
      "only showing top 20 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df_transformed.select(\"weekday\", \"weekday_upper\").show() # Show result"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python3.12 (PySpark3.5.4)",
   "language": "python",
   "name": "pyspark"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.8"
  },
  "rise": {
   "auto_select": "first",
   "autolaunch": false,
   "chalkboard": {
    "chalkEffect": 1,
    "chalkWidth": 4,
    "theme": "whiteboard",
    "transition": 800
   },
   "enable_chalkboard": true,
   "reveal_shortcuts": {
    "chalkboard": {
     "clear": "ctrl-k"
    }
   },
   "start_slideshow_at": "selected",
   "theme": "black"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}