{ "cells": [ { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "# Structured Data Processing with Spark\n", "\n", "\n", "## Feng Li\n", "\n", "### Guanghua School of Management\n", "### Peking University\n", "\n", "\n", "### [feng.li@gsm.pku.edu.cn](feng.li@gsm.pku.edu.cn)\n", "### Course home page: [https://feng.li/bdcf](https://feng.li/bdcf)" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Spark SQL\n", "\n", "- Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed.\n", "\n", "- Spark SQL uses this extra information to perform extra optimizations.\n", "\n", "- One use of Spark SQL is to execute SQL queries.\n", "\n", "- Spark SQL can also be used to read data from an existing Hive installation." ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "### Spark DataFrame\n", "\n", "- A DataFrame is a Dataset organized into named columns. \n", "\n", "- It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. \n", "\n", "- DataFrames can be constructed from a wide array of sources such as: \n", "\n", " - structured data files, \n", " - tables in Hive, \n", " - external databases, or \n", " - existing RDDs. \n", " \n", "- The DataFrame API is available in Scala, Java, Python, and R. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Start a Spark session\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Setting default log level to \"WARN\".\n", "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n", "25/02/23 20:39:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n" ] } ], "source": [ "import pyspark\n", "from pyspark.sql import SparkSession\n", "spark = SparkSession.builder\\\n", " .config(\"spark.executor.memory\", \"2g\")\\\n", " .config(\"spark.cores.max\", \"2\")\\\n", " .appName(\"Spark DataFrames\").getOrCreate() # using spark server" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "\n", "
SparkSession - in-memory
\n", " \n", "SparkContext
\n", "\n", " \n", "\n", "v3.5.4
local[*]
Spark DataFrames