{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Forecasting with Gradient Boosted Tree\n", "\n", "## Feng Li\n", "\n", "### Guanghua School of Management\n", "### Peking University\n", "\n", "\n", "### [feng.li@gsm.pku.edu.cn](feng.li@gsm.pku.edu.cn)\n", "### Course home page: [https://feng.li/bdcf](https://feng.li/bdcf)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Gradient Boosted Tree Regressor\n", "\n", "`GBTRegressor` (**Gradient Boosted Tree Regressor**) in **Spark MLlib** is a **supervised learning algorithm** that builds an ensemble of decision trees using **gradient boosting** to improve predictive accuracy.\n", "\n", "\n", "### How GBTRegressor Works\n", "Gradient Boosted Trees (GBT) work by **training decision trees sequentially**, where:\n", "1. The **first tree** makes an initial prediction.\n", "2. Each subsequent tree learns from the **errors (residuals)** of the previous trees.\n", "3. The final prediction is the sum of all trees’ outputs.\n", "\n", "This technique is effective for **handling non-linear relationships** in data and reducing **bias and variance**." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Code Example\n", "```python\n", "from pyspark.ml.regression import GBTRegressor\n", "from pyspark.ml.evaluation import RegressionEvaluator\n", "\n", "# Initialize GBT Regressor\n", "gbt = GBTRegressor(featuresCol=\"features\", labelCol=\"sales\", maxIter=50, maxDepth=5, stepSize=0.1)\n", "\n", "# Train the model\n", "model = gbt.fit(train_df)\n", "\n", "# Make predictions\n", "predictions = model.transform(test_df)\n", "\n", "# Evaluate using RMSE\n", "evaluator = RegressionEvaluator(labelCol=\"sales\", predictionCol=\"prediction\", metricName=\"rmse\")\n", "rmse = evaluator.evaluate(predictions)\n", "\n", "print(f\"Root Mean Squared Error (RMSE): {rmse}\")\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Key Hyperparameters\n", "| Parameter | Description |\n", "|-----------------|-------------|\n", "| `maxIter` | Number of trees in the ensemble (higher = more complex model) |\n", "| `maxDepth` | Maximum depth of each tree (higher = risk of overfitting) |\n", "| `stepSize` | Learning rate (default `0.1` for stability) |\n", "| `subsamplingRate` | Fraction of data used for each tree (default `1.0`, full dataset) |\n", "| `maxBins` | Number of bins for feature discretization (default `32`) |\n", "| `minInstancesPerNode` | Minimum instances required per node (default `1`) |\n", "\n", "### Recommended Settings\n", "- **For small datasets** → `maxIter=20, maxDepth=3`\n", "- **For large datasets** → `maxIter=50, maxDepth=5`\n", "- **For fine-tuning** → Adjust `stepSize` (`0.05 - 0.2`)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\n", "## Advantages and Limitations\n", "\n", "✅ **Handles complex non-linear relationships** \n", "✅ **More accurate than a single Decision Tree** \n", "✅ **Built-in feature selection** (important features contribute more) \n", "✅ **Works well with missing values** \n", "\n", "\n", "🚨 **Slower training compared to Random Forest** (sequential training of trees) \n", "🚨 **Prone to overfitting with large `maxDepth`** \n", "🚨 **Not suited for real-time applications** (expensive to update) \n", "\n", "---" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Setting default log level to \"WARN\".\n", "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n", "25/03/11 20:02:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n" ] }, { "data": { "text/html": [ "\n", "
SparkSession - in-memory
\n", " \n", "SparkContext
\n", "\n", " \n", "\n", "v3.5.3
local[*]
Optimized Spark