{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Structured Data Processing with Spark\n", "\n", "\n", "## Feng Li\n", "\n", "### Central University of Finance and Economics\n", "\n", "### [feng.li@cufe.edu.cn](feng.li@cufe.edu.cn)\n", "### Course home page: [https://feng.li/distcomp](https://feng.li/distcomp)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Spark SQL\n", "\n", "- Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed.\n", "\n", "- Spark SQL uses this extra information to perform extra optimizations.\n", "\n", "- One use of Spark SQL is to execute SQL queries.\n", "\n", "- Spark SQL can also be used to read data from an existing Hive installation." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Datasets and DataFrames\n", "\n", "\n", "### Spark Datasets\n", "\n", "- A Dataset is a distributed collection of data.\n", "\n", "- Dataset can be constructed from JVM objects and then manipulated using functional transformations (`map, flatMap, filter`, etc.). \n", "\n", "- The Dataset API is available in **Scala** and **Java**. \n", "\n", "- Python **does not have the support for the Dataset API**. But due to Python’s dynamic nature, many of the benefits of the Dataset API are already available (i.e. you can access the field of a row by name naturally row.columnName)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Spark DataFrame\n", "\n", "- A DataFrame is a Dataset organized into named columns. \n", "\n", "- It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. \n", "\n", "- DataFrames can be constructed from a wide array of sources such as: \n", "\n", " - structured data files, \n", " - tables in Hive, \n", " - external databases, or \n", " - existing RDDs. \n", " \n", "- The DataFrame API is available in Scala, Java, Python, and R. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Start a Spark session\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "import findspark ## Only needed when you run spark witin Jupyter notebook\n", "findspark.init()\n", "import pyspark\n", "from pyspark.sql import SparkSession\n", "spark = SparkSession.builder\\\n", " .config(\"spark.executor.memory\", \"2g\")\\\n", " .config(\"spark.cores.max\", \"2\")\\\n", " .master(\"spark://master:7077\")\\\n", " .appName(\"Python Spark\").getOrCreate() # using spark server" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "\n", "
SparkSession - in-memory
\n", " \n", "SparkContext
\n", "\n", " \n", "\n", "v2.4.4
local[*]
Python Spark
SparkContext
\n", "\n", " \n", "\n", "v2.4.4
local[*]
Python Spark
\n", " | name | \n", "
---|---|
0 | \n", "Justin | \n", "