{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Machine Learning with Spark\n", "\n", "## Feng Li\n", "\n", "### Central University of Finance and Economics\n", "\n", "### [feng.li@cufe.edu.cn](feng.li@cufe.edu.cn)\n", "### Course home page: [https://feng.li/distcomp](https://feng.li/distcomp)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Machine Learning Library\n", "\n", "MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. At a high level, it provides tools such as:\n", "\n", "- ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering\n", "- Featurization: feature extraction, transformation, dimensionality reduction, and selection\n", "- Pipelines: tools for constructing, evaluating, and tuning ML Pipelines\n", "- Persistence: saving and load algorithms, models, and Pipelines\n", "- Utilities: linear algebra, statistics, data handling, etc." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## MLlib APIs\n", "\n", "\n", "- The RDD-based APIs in the `spark.mllib` package have entered maintenance mode. \n", "\n", "- The primary Machine Learning API for Spark is now the DataFrame-based API in the `spark.ml` package.\n", "\n", "- Why the DataFrame-based API?\n", "\n", " - DataFrames provide a more user-friendly API than RDDs. The many benefits of DataFrames include Spark Datasources, SQL/DataFrame queries, Tungsten and Catalyst optimizations, and uniform APIs across languages.\n", " \n", " - The DataFrame-based API for MLlib provides a uniform API across ML algorithms and across multiple languages.\n", " \n", " - DataFrames facilitate practical ML Pipelines, particularly feature transformations. See the Pipelines guide for details." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Start a Spark Session " ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "\n", "
SparkSession - in-memory
\n", " \n", "SparkContext
\n", "\n", " \n", "\n", "v2.4.4
local[*]
Python Spark with ML