{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Machine Learning with Spark\n", "\n", "\n", "## Feng Li\n", "\n", "### Guanghua School of Management\n", "### Peking University\n", "\n", "\n", "### [feng.li@gsm.pku.edu.cn](feng.li@gsm.pku.edu.cn)\n", "### Course home page: [https://feng.li/bdcf](https://feng.li/bdcf)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Machine Learning Library\n", "\n", "MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. At a high level, it provides tools such as:\n", "\n", "- **ML Algorithms**: common learning algorithms such as classification, regression, clustering, and collaborative filtering\n", "- **Featurization**: feature extraction, transformation, dimensionality reduction, and selection\n", "- **Pipelines**: tools for constructing, evaluating, and tuning ML Pipelines\n", "- **Persistence**: saving and load algorithms, models, and Pipelines\n", "- **Utilities**: linear algebra, statistics, data handling, etc." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## MLlib APIs\n", "\n", "\n", "- The old RDD-based APIs in the `spark.mllib` package have entered maintenance mode. \n", "\n", "- The primary Machine Learning API for Spark is now the DataFrame-based API in the `spark.ml` package.\n", "\n", "- Why the DataFrame-based API?\n", "\n", " - DataFrames provide a more user-friendly API than RDDs. The many benefits of DataFrames include Spark Datasources, SQL/DataFrame queries, Tungsten and Catalyst optimizations, and uniform APIs across languages.\n", " \n", " - The DataFrame-based API for MLlib provides a uniform API across ML algorithms and across multiple languages.\n", " \n", " - DataFrames facilitate practical ML Pipelines, particularly feature transformations. See the Pipelines guide for details." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Start a Spark Session " ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "\n", "
SparkSession - in-memory
\n", " \n", "SparkContext
\n", "\n", " \n", "\n", "v3.5.3
local[*]
Spark DataFrame