{ "cells": [ { "cell_type": "markdown", "id": "9e8e550f-c673-4d14-a2e2-1e36692daaf1", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "# Time Series Features\n", "\n", "\n", "## Feng Li\n", "\n", "### Guanghua School of Management\n", "### Peking University\n", "\n", "\n", "### [feng.li@gsm.pku.edu.cn](feng.li@gsm.pku.edu.cn)\n", "### Course home page: [https://feng.li/bdcf](https://feng.li/bdcf)" ] }, { "cell_type": "markdown", "id": "80f075fb-9283-441c-bece-cbeb74d7d45d", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Why do we need time series features? --- The No-Free-Lunch theorem \n", "\n", "- There is never universally best method that fits in all situations.\n", "\n", "- The explosion of new algorithms development makes the question even more worth focusing.\n", "\n", "- No single forecasting method stands out the best for any type of time series.\n" ] }, { "cell_type": "markdown", "id": "ea481272-34af-4808-8bad-589f3eacc067", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "\n", "## Literature \n", "\n", "- Features of time series $\\rightarrow$ benefits in producing more accurate forecasting accuracies \n", "\n", "- Features $\\rightarrow$ forecasting method selection rules\n", "\n", "- \"Horses for courses\" $\\rightarrow$ effects of time series features to the forecasting performances\n", "\n", "- Visualize the performances of different forecasting methods $\\rightarrow$ better understanding of their relative performances" ] }, { "cell_type": "markdown", "id": "1f2605e7-ae23-483c-8bc5-de52f0c0114c", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Existing problems \n", "\n", "- inadequate features\n", "\n", "- limited training time series data (not only in number, but in diversity)" ] }, { "cell_type": "markdown", "id": "cd996637-ccca-446a-a479-9aa0e7495d87", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Questions to be answered\n", "\n", "- What time series features should be used?\n", "- How to construct time series features?\n", "- How to visualize time series features by projection?\n", "- How to model features and forecasting methods?\n", "- How to generate new time series with certain features?" ] }, { "cell_type": "markdown", "id": "fd100032-efcf-4238-ac52-48c7d9f383fe", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Time series features\n", "\n", "### Basic idea\n", "\n", "Transform a given time series $\\{x_1, x_2, \\cdots, x_n\\}$ to a feature vector $F = (F_1, F_2, \\cdots, F_p)$. \n", "\n", "#### A feature $F_k$ can be any kind of function computed from a time series:\n", "\n", "1. A simple mean\n", "2. The parameter of a fitted model\n", "3. Some statistic intended to highlight an attribute of the data\n", "4. ...\n" ] }, { "cell_type": "markdown", "id": "b413b141-e4d9-4fb2-97a2-feab698adfda", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Which features should we use?\n", "\n", "- There does not exist the best feature representation of a time series.\n", "- Depends on both the **nature** of the time series being analysed, and the **purpose** of the analysis. \n", "\n", " - With unit roots, the mean is not a meaningful feature without some constraints on the initial values. \\pause\n", " \n", " - CPU usage every minute for a large number of servers: we observe a daily seasonality. The mean may provide useful comparative information despite the time series not being stationary." ] }, { "cell_type": "markdown", "id": "b2041278-4a50-4950-b1be-0f111b86c1ed", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "- Time series are of different lengths, on different scales, and with different properties.\n", "- We restrict our features to be ergodic, stationary and independent of scale. \n", "- 17 sets of diverse features.\n", "- New features are intended to measure attributes associated with multiple seasonality, non-stationarity and heterogeneity of the time series." ] }, { "cell_type": "markdown", "id": "6c1f9ef8-7389-4e73-ba7c-1fcaad759b6c", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Features for multiple seasonal time series \n", "\n", "### STL decompostion extension\n", "$$ x_t = f_t + s_{1,t} + s_{2,t} + \\cdots + s_{M,t} + e_t.$$\n", "The strength of trend can be measured by:\n", "$$\n", " F_{10} = 1- \\frac{\\text{var}(e_t)}{\\text{var}(f_t + e_t)}.\n", "$$\n", "\n", "The strength of seasonality for the $i$th seasonal component:\n", "\n", "$$\n", "F_{11,i} = 1- \\frac{\\text{var}(e_t)}{\\text{var}(s_{i,t} + e_t)}.\n", "$$" ] }, { "cell_type": "markdown", "id": "bfe14497-62a7-4af2-ad10-a8e0f6e86460", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Features on heterogenity\n", "\n", "1. Pre-whiten the time series $x_t$ to remove the mean, trend, and Autoregressive (AR) information.\n", "3. Fit an GARCH(1,1) model on the pre-whitened time series $y_t$ to measure for the ARCH effects.\n", "4. Test for the arch effects in the obtained residuals $z_t$ using a second GARCH(1,1) model. \n", "\n", "### Features\n", "\n", "- The sum of squares of the first 12 autocorrelations of $\\{y_t^2\\}$.\n", "- The sum of squares of the first 12 autocorrelations of $\\{z_t^2\\}$.\n", "- The $R^2$ value of an AR model applied to $\\{y_t^2\\}$.\n", "- The $R^2$ value of an AR model applied to $\\{z_t^2\\}$." ] }, { "cell_type": "markdown", "id": "7cdace3a-1d4d-4b03-a864-a1622261d324", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Walmart unit sales data\n", "\n", "![Walmart](./figures/M5data.png)" ] }, { "cell_type": "markdown", "id": "499e93f9-ce9e-48b9-85fa-15a9c4d8189f", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Data Structure\n", "\n", "| Hierarchy Level | Description | Number of Series |\n", "|----------------|----------------------------------------------------------------------|------------------|\n", "| 1 | All products, all stores, all states | 1 |\n", "| 2 | All products by states | 3 |\n", "| 3 | All products by store | 10 |\n", "| 4 | All products by category | 3 |\n", "| 5 | All products by department | 7 |\n", "| 6 | Unit sales of all products, aggregated for each State and category | 9 |\n", "| 7 | Unit sales of all products, aggregated for each State and department | 21 |\n", "| 8 | Unit sales of all products, aggregated for each store and category | 30 |\n", "| 9 | Unit sales of all products, aggregated for each store and department | 70 |\n", "| 10 | Unit sales of product *x*, aggregated for all stores/states | 3,049 |\n", "| 11 | Unit sales of product *x*, aggregated for each State | 9,147 |\n", "| 12 | Unit sales of product *x*, aggregated for each store | 30,490 |\n", "| **Total** | | **42,840** |\n" ] }, { "cell_type": "markdown", "id": "388d9016-d0c6-4d1e-86f0-fda5097c8ef7", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Features for sales data \n", "\n", "| Feature | Description |\n", "|------------------------|--------------------------------------------------------------------------------------------------|\n", "| `sell_price` | Price of item in store for given date. |\n", "| `event_type` | 108 categorical events, e.g. sporting, cultural, religious. |\n", "| `event_name` | 157 event names for `event_type`, e.g. Super Bowl, Valentine's Day, President's Day. |\n", "| `event_name_2` | Name of event feature as given in competition data. |\n", "| `event_type_2` | Type of event feature as given in competition data. |\n", "| `snap_CA, TX, WI` | Binary indicator for SNAP information in CA, TX, WI. |\n", "| `release` | Release week of item in store. |\n", "\n", "- hierarchical structure of daily sales data of total $42,840$ series spanning 1,941 days" ] }, { "cell_type": "markdown", "id": "a4a603d4-95c6-42d5-b9b6-b88e7b4ede5a", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Features for sales data \n", "\n", "| Feature | Description |\n", "|------------------------|--------------------------------------------------------------------------------------------------|\n", "| `price_max, min` | Maximum, minimum price for item in store in the train data. |\n", "| `price_mean, std, norm` | Mean, standard deviation, and normalized price for item in store in the train data. |\n", "| `item, price_nunique` | Number of unique items, prices for item in store. |\n", "| `price_diff_w` | Weekly price changes for items in store. |\n", "| `price_diff_m` | Price changes of item in store compared to its monthly mean. |\n", "| `price_diff_y` | Price changes of item in store compared to its yearly mean. |\n", "| `tm_d` | Day of month. |\n", "| `tm_w` | Week in year. |\n", "| `tm_m` | Month in year. |\n", "| `tm_y` | Year index in the train data. |\n", "| `tm_wm` | Week in month. |\n", "| `tm_dw` | Day of week. |\n", "| `tm_w_end` | Weekend indicator. |\n" ] }, { "cell_type": "markdown", "id": "7dc5010b-9551-4285-bf07-d2da161adf31", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Visualisation features in 2D space\n", "\n", "#### t-Stochastic Neighbor Embedding (t-SNE)\n", "- Main idea: convert the distances to conditional probabilities and minimize the mismatch (kullback-Leibler divergence) between probabilities before and after the mapping.\n", "- Nonlinear and retaining both local and global structure\n", "\n", "#### PCA\n", "- Linear, and putting more emphasis on keeping dissimilar data points far apart\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.3" } }, "nbformat": 4, "nbformat_minor": 5 }