{ "cells": [ { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "# Structured Data Processing with Spark\n", "\n", "\n", "## Feng Li\n", "\n", "### Guanghua School of Management\n", "### Peking University\n", "\n", "\n", "### [feng.li@gsm.pku.edu.cn](feng.li@gsm.pku.edu.cn)\n", "### Course home page: [https://feng.li/bdcf](https://feng.li/bdcf)" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Spark SQL\n", "\n", "- Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed.\n", "\n", "- Spark SQL uses this extra information to perform extra optimizations.\n", "\n", "- One use of Spark SQL is to execute SQL queries.\n", "\n", "- Spark SQL can also be used to read data from an existing Hive installation." ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "### Spark DataFrame\n", "\n", "- A DataFrame is a Dataset organized into named columns. \n", "\n", "- It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. \n", "\n", "- DataFrames can be constructed from a wide array of sources such as: \n", "\n", " - structured data files, \n", " - tables in Hive, \n", " - external databases, or \n", " - existing RDDs. \n", " \n", "- The DataFrame API is available in Scala, Java, Python, and R. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Start a Spark session\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Setting default log level to \"WARN\".\n", "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n", "25/02/23 20:39:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n" ] } ], "source": [ "import pyspark\n", "from pyspark.sql import SparkSession\n", "spark = SparkSession.builder\\\n", " .config(\"spark.executor.memory\", \"2g\")\\\n", " .config(\"spark.cores.max\", \"2\")\\\n", " .appName(\"Spark DataFrames\").getOrCreate() # using spark server" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "

SparkSession - in-memory

\n", " \n", "
\n", "

SparkContext

\n", "\n", "

Spark UI

\n", "\n", "
\n", "
Version
\n", "
v3.5.4
\n", "
Master
\n", "
local[*]
\n", "
AppName
\n", "
Spark DataFrames
\n", "
\n", "
\n", " \n", "
\n", " " ], "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "spark # test if Spark session is created or not" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Creating DataFrames" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Upload Data to HPC server \n", "\n", "- **Using SCOW web interface** (small dataset)\n", "\n", " https://scow-jx2.pku.edu.cn/files/jx2/~/\n", "\n", "- **Using FileZilla**\n", "\n", " - Make sure you have OTP enabled with your account\n", "\n", " https://hpc.pku.edu.cn/ug/guide/access/\n", "\n", " - FileZilla Menu -> File -> Site Manager\n", " \n", " **Host**: `wmjx2-login.pku.edu.cn`\n", " \n", " **Logon Type**: Interactive\n", " \n", " **User**: youruniversityid\n", " \n", "\n", "- **Using command line tool `rsync`**\n" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Create Spark DataFrame directly from a file" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "editable": true, "slideshow": { "slide_type": "fragment" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+----------+--------+---------+----+-----+----+----+-------------+------------+------------+------------+-------+-------+-------+\n", "| date|wm_yr_wk| weekday|wday|month|year| d| event_name_1|event_type_1|event_name_2|event_type_2|snap_CA|snap_TX|snap_WI|\n", "+----------+--------+---------+----+-----+----+----+-------------+------------+------------+------------+-------+-------+-------+\n", "|2011-01-29| 11101| Saturday| 1| 1|2011| d_1| NULL| NULL| NULL| NULL| 0| 0| 0|\n", "|2011-01-30| 11101| Sunday| 2| 1|2011| d_2| NULL| NULL| NULL| NULL| 0| 0| 0|\n", "|2011-01-31| 11101| Monday| 3| 1|2011| d_3| NULL| NULL| NULL| NULL| 0| 0| 0|\n", "|2011-02-01| 11101| Tuesday| 4| 2|2011| d_4| NULL| NULL| NULL| NULL| 1| 1| 0|\n", "|2011-02-02| 11101|Wednesday| 5| 2|2011| d_5| NULL| NULL| NULL| NULL| 1| 0| 1|\n", "|2011-02-03| 11101| Thursday| 6| 2|2011| d_6| NULL| NULL| NULL| NULL| 1| 1| 1|\n", "|2011-02-04| 11101| Friday| 7| 2|2011| d_7| NULL| NULL| NULL| NULL| 1| 0| 0|\n", "|2011-02-05| 11102| Saturday| 1| 2|2011| d_8| NULL| NULL| NULL| NULL| 1| 1| 1|\n", "|2011-02-06| 11102| Sunday| 2| 2|2011| d_9| SuperBowl| Sporting| NULL| NULL| 1| 1| 1|\n", "|2011-02-07| 11102| Monday| 3| 2|2011|d_10| NULL| NULL| NULL| NULL| 1| 1| 0|\n", "|2011-02-08| 11102| Tuesday| 4| 2|2011|d_11| NULL| NULL| NULL| NULL| 1| 0| 1|\n", "|2011-02-09| 11102|Wednesday| 5| 2|2011|d_12| NULL| NULL| NULL| NULL| 1| 1| 1|\n", "|2011-02-10| 11102| Thursday| 6| 2|2011|d_13| NULL| NULL| NULL| NULL| 1| 0| 0|\n", "|2011-02-11| 11102| Friday| 7| 2|2011|d_14| NULL| NULL| NULL| NULL| 0| 1| 1|\n", "|2011-02-12| 11103| Saturday| 1| 2|2011|d_15| NULL| NULL| NULL| NULL| 0| 1| 1|\n", "|2011-02-13| 11103| Sunday| 2| 2|2011|d_16| NULL| NULL| NULL| NULL| 0| 1| 0|\n", "|2011-02-14| 11103| Monday| 3| 2|2011|d_17|ValentinesDay| Cultural| NULL| NULL| 0| 0| 1|\n", "|2011-02-15| 11103| Tuesday| 4| 2|2011|d_18| NULL| NULL| NULL| NULL| 0| 1| 1|\n", "|2011-02-16| 11103|Wednesday| 5| 2|2011|d_19| NULL| NULL| NULL| NULL| 0| 0| 0|\n", "|2011-02-17| 11103| Thursday| 6| 2|2011|d_20| NULL| NULL| NULL| NULL| 0| 0| 0|\n", "+----------+--------+---------+----+-----+----+----+-------------+------------+------------+------------+-------+-------+-------+\n", "only showing top 20 rows\n", "\n" ] } ], "source": [ "sdf = spark.read.csv(\"../data/m5-forecasting-accuracy/calendar.csv\", header=True) # read files\n", "# sdf = spark.read.csv(\"file:///nfs-share/home/2406184221/bdcf-slides/data/m5-forecasting-accuracy/calendar.csv\") # or use the full path\n", "sdf.show() # Displays the content of the DataFrame to stdout" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "### Export DataFrame to a local disk" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "editable": true, "slideshow": { "slide_type": "fragment" }, "tags": [] }, "outputs": [], "source": [ "sdf.write.mode('overwrite').csv(\"myspark/\")# Save Spark DataFrame to a folder on the local disk." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "editable": true, "slideshow": { "slide_type": "fragment" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "['._SUCCESS.crc',\n", " '_SUCCESS',\n", " 'part-00000-a7fa5a2a-9dd5-4c48-b472-2f815b8d027d-c000.csv',\n", " '.part-00000-a7fa5a2a-9dd5-4c48-b472-2f815b8d027d-c000.csv.crc']" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import os \n", "os.listdir(\"myspark\") # Let's check if everything is there on the local disk" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python3.12 (PySpark3.5.4)", "language": "python", "name": "pyspark" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.8" }, "rise": { "auto_select": "first", "autolaunch": false, "chalkboard": { "chalkEffect": 1, "chalkWidth": 4, "theme": "whiteboard", "transition": 800 }, "enable_chalkboard": true, "reveal_shortcuts": { "chalkboard": { "clear": "ctrl-k" } }, "start_slideshow_at": "selected", "theme": "black" } }, "nbformat": 4, "nbformat_minor": 4 }