{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "# Structured Data Processing with Spark\n",
    "\n",
    "\n",
    "## Feng Li\n",
    "\n",
    "### Guanghua School of Management\n",
    "### Peking University\n",
    "\n",
    "\n",
    "### [feng.li@gsm.pku.edu.cn](feng.li@gsm.pku.edu.cn)\n",
    "### Course home page: [https://feng.li/bdcf](https://feng.li/bdcf)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "## Spark SQL\n",
    "\n",
    "- Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed.\n",
    "\n",
    "- Spark SQL uses this extra information to perform extra optimizations.\n",
    "\n",
    "- One use of Spark SQL is to execute SQL queries.\n",
    "\n",
    "- Spark SQL can also be used to read data from an existing Hive installation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "### Spark DataFrame\n",
    "\n",
    "- A DataFrame is a Dataset organized into named columns. \n",
    "\n",
    "- It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. \n",
    "\n",
    "- DataFrames can be constructed from a wide array of sources such as: \n",
    "\n",
    "    - structured data files, \n",
    "    - tables in Hive, \n",
    "    - external databases, or \n",
    "    - existing RDDs. \n",
    "    \n",
    "- The DataFrame API is available in Scala, Java, Python, and R. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Start a Spark session\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Setting default log level to \"WARN\".\n",
      "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n",
      "25/02/23 20:39:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n"
     ]
    }
   ],
   "source": [
    "import pyspark\n",
    "from pyspark.sql import SparkSession\n",
    "spark = SparkSession.builder\\\n",
    "        .config(\"spark.executor.memory\", \"2g\")\\\n",
    "        .config(\"spark.cores.max\", \"2\")\\\n",
    "        .appName(\"Spark DataFrames\").getOrCreate() # using spark server"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "            <div>\n",
       "                <p><b>SparkSession - in-memory</b></p>\n",
       "                \n",
       "        <div>\n",
       "            <p><b>SparkContext</b></p>\n",
       "\n",
       "            <p><a href=\"http://precisiontower:4040\">Spark UI</a></p>\n",
       "\n",
       "            <dl>\n",
       "              <dt>Version</dt>\n",
       "                <dd><code>v3.5.4</code></dd>\n",
       "              <dt>Master</dt>\n",
       "                <dd><code>local[*]</code></dd>\n",
       "              <dt>AppName</dt>\n",
       "                <dd><code>Spark DataFrames</code></dd>\n",
       "            </dl>\n",
       "        </div>\n",
       "        \n",
       "            </div>\n",
       "        "
      ],
      "text/plain": [
       "<pyspark.sql.session.SparkSession at 0x7c7724dad040>"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "spark # test if Spark session is created or not"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "## Creating DataFrames"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Upload Data to HPC server \n",
    "\n",
    "- **Using SCOW web interface** (small dataset)\n",
    "\n",
    "  https://scow-jx2.pku.edu.cn/files/jx2/~/\n",
    "\n",
    "- **Using FileZilla**\n",
    "\n",
    "    - Make sure you have OTP enabled with your account\n",
    "\n",
    "      https://hpc.pku.edu.cn/ug/guide/access/\n",
    "\n",
    "    - FileZilla Menu -> File -> Site Manager\n",
    " \n",
    "      **Host**: `wmjx2-login.pku.edu.cn`\n",
    " \n",
    "      **Logon Type**: Interactive\n",
    " \n",
    "      **User**: youruniversityid\n",
    " \n",
    "\n",
    "- **Using command line tool `rsync`**\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "## Create Spark DataFrame directly from a file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "fragment"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+----------+--------+---------+----+-----+----+----+-------------+------------+------------+------------+-------+-------+-------+\n",
      "|      date|wm_yr_wk|  weekday|wday|month|year|   d| event_name_1|event_type_1|event_name_2|event_type_2|snap_CA|snap_TX|snap_WI|\n",
      "+----------+--------+---------+----+-----+----+----+-------------+------------+------------+------------+-------+-------+-------+\n",
      "|2011-01-29|   11101| Saturday|   1|    1|2011| d_1|         NULL|        NULL|        NULL|        NULL|      0|      0|      0|\n",
      "|2011-01-30|   11101|   Sunday|   2|    1|2011| d_2|         NULL|        NULL|        NULL|        NULL|      0|      0|      0|\n",
      "|2011-01-31|   11101|   Monday|   3|    1|2011| d_3|         NULL|        NULL|        NULL|        NULL|      0|      0|      0|\n",
      "|2011-02-01|   11101|  Tuesday|   4|    2|2011| d_4|         NULL|        NULL|        NULL|        NULL|      1|      1|      0|\n",
      "|2011-02-02|   11101|Wednesday|   5|    2|2011| d_5|         NULL|        NULL|        NULL|        NULL|      1|      0|      1|\n",
      "|2011-02-03|   11101| Thursday|   6|    2|2011| d_6|         NULL|        NULL|        NULL|        NULL|      1|      1|      1|\n",
      "|2011-02-04|   11101|   Friday|   7|    2|2011| d_7|         NULL|        NULL|        NULL|        NULL|      1|      0|      0|\n",
      "|2011-02-05|   11102| Saturday|   1|    2|2011| d_8|         NULL|        NULL|        NULL|        NULL|      1|      1|      1|\n",
      "|2011-02-06|   11102|   Sunday|   2|    2|2011| d_9|    SuperBowl|    Sporting|        NULL|        NULL|      1|      1|      1|\n",
      "|2011-02-07|   11102|   Monday|   3|    2|2011|d_10|         NULL|        NULL|        NULL|        NULL|      1|      1|      0|\n",
      "|2011-02-08|   11102|  Tuesday|   4|    2|2011|d_11|         NULL|        NULL|        NULL|        NULL|      1|      0|      1|\n",
      "|2011-02-09|   11102|Wednesday|   5|    2|2011|d_12|         NULL|        NULL|        NULL|        NULL|      1|      1|      1|\n",
      "|2011-02-10|   11102| Thursday|   6|    2|2011|d_13|         NULL|        NULL|        NULL|        NULL|      1|      0|      0|\n",
      "|2011-02-11|   11102|   Friday|   7|    2|2011|d_14|         NULL|        NULL|        NULL|        NULL|      0|      1|      1|\n",
      "|2011-02-12|   11103| Saturday|   1|    2|2011|d_15|         NULL|        NULL|        NULL|        NULL|      0|      1|      1|\n",
      "|2011-02-13|   11103|   Sunday|   2|    2|2011|d_16|         NULL|        NULL|        NULL|        NULL|      0|      1|      0|\n",
      "|2011-02-14|   11103|   Monday|   3|    2|2011|d_17|ValentinesDay|    Cultural|        NULL|        NULL|      0|      0|      1|\n",
      "|2011-02-15|   11103|  Tuesday|   4|    2|2011|d_18|         NULL|        NULL|        NULL|        NULL|      0|      1|      1|\n",
      "|2011-02-16|   11103|Wednesday|   5|    2|2011|d_19|         NULL|        NULL|        NULL|        NULL|      0|      0|      0|\n",
      "|2011-02-17|   11103| Thursday|   6|    2|2011|d_20|         NULL|        NULL|        NULL|        NULL|      0|      0|      0|\n",
      "+----------+--------+---------+----+-----+----+----+-------------+------------+------------+------------+-------+-------+-------+\n",
      "only showing top 20 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "sdf = spark.read.csv(\"../data/m5-forecasting-accuracy/calendar.csv\", header=True) # read files\n",
    "# sdf = spark.read.csv(\"file:///nfs-share/home/2406184221/bdcf-slides/data/m5-forecasting-accuracy/calendar.csv\") # or use the full path\n",
    "sdf.show() # Displays the content of the DataFrame to stdout"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "### Export DataFrame to a local disk"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "fragment"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "sdf.write.mode('overwrite').csv(\"myspark/\")# Save Spark DataFrame to a folder on the local disk."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "fragment"
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['._SUCCESS.crc',\n",
       " '_SUCCESS',\n",
       " 'part-00000-a7fa5a2a-9dd5-4c48-b472-2f815b8d027d-c000.csv',\n",
       " '.part-00000-a7fa5a2a-9dd5-4c48-b472-2f815b8d027d-c000.csv.crc']"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import os  \n",
    "os.listdir(\"myspark\") # Let's check if everything is there on the local disk"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python3.12 (PySpark3.5.4)",
   "language": "python",
   "name": "pyspark"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.8"
  },
  "rise": {
   "auto_select": "first",
   "autolaunch": false,
   "chalkboard": {
    "chalkEffect": 1,
    "chalkWidth": 4,
    "theme": "whiteboard",
    "transition": 800
   },
   "enable_chalkboard": true,
   "reveal_shortcuts": {
    "chalkboard": {
     "clear": "ctrl-k"
    }
   },
   "start_slideshow_at": "selected",
   "theme": "black"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}