{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "# Time Series Features\n", "\n", "\n", "## Feng Li\n", "\n", "### Guanghua School of Management\n", "### Peking University\n", "\n", "\n", "### [feng.li@gsm.pku.edu.cn](feng.li@gsm.pku.edu.cn)\n", "### Course home page: [https://feng.li/bdcf](https://feng.li/bdcf)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Why do we need time series features? --- The No-Free-Lunch theorem \n", "\n", "- There is never universally best method that fits in all situations.\n", "\n", "- The explosion of new algorithms development makes the question even more worth focusing.\n", "\n", "- No single forecasting method stands out the best for any type of time series.\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "\n", "## Literature \n", "\n", "- Features of time series $\\rightarrow$ benefits in producing more accurate forecasting accuracies \n", "\n", "- Features $\\rightarrow$ forecasting method selection rules\n", "\n", "- \"Horses for courses\" $\\rightarrow$ effects of time series features to the forecasting performances\n", "\n", "- Visualize the performances of different forecasting methods $\\rightarrow$ better understanding of their relative performances" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Existing problems \n", "\n", "- inadequate features\n", "\n", "- limited training time series data (not only in number, but in diversity)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Questions to be answered\n", "\n", "- What time series features should be used?\n", "- How to construct time series features?\n", "- How to visualize time series features by projection?\n", "- How to model features and forecasting methods?\n", "- How to generate new time series with certain features?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Time series features\n", "\n", "### Basic idea\n", "\n", "Transform a given time series $\\{x_1, x_2, \\cdots, x_n\\}$ to a feature vector $F = (F_1, F_2, \\cdots, F_p)$. \n", "\n", "#### A feature $F_k$ can be any kind of function computed from a time series:\n", "\n", "1. A simple mean\n", "2. The parameter of a fitted model\n", "3. Some statistic intended to highlight an attribute of the data\n", "4. ...\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Which features should we use?\n", "\n", "- There does not exist the best feature representation of a time series.\n", "- Depends on both the **nature** of the time series being analysed, and the **purpose** of the analysis. \n", "\n", " - With unit roots, the mean is not a meaningful feature without some constraints on the initial values. \\pause\n", " \n", " - CPU usage every minute for a large number of servers: we observe a daily seasonality. The mean may provide useful comparative information despite the time series not being stationary." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "- Time series are of different lengths, on different scales, and with different properties.\n", "- We restrict our features to be ergodic, stationary and independent of scale. \n", "- 17 sets of diverse features.\n", "- New features are intended to measure attributes associated with multiple seasonality, non-stationarity and heterogeneity of the time series." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Features for multiple seasonal time series \n", "\n", "### STL decompostion extension\n", "$$ x_t = f_t + s_{1,t} + s_{2,t} + \\cdots + s_{M,t} + e_t.$$\n", "The strength of trend can be measured by:\n", "$$\n", " F_{10} = 1- \\frac{\\text{var}(e_t)}{\\text{var}(f_t + e_t)}.\n", "$$\n", "\n", "The strength of seasonality for the $i$th seasonal component:\n", "\n", "$$\n", "F_{11,i} = 1- \\frac{\\text{var}(e_t)}{\\text{var}(s_{i,t} + e_t)}.\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Features on heterogenity\n", "\n", "1. Pre-whiten the time series $x_t$ to remove the mean, trend, and Autoregressive (AR) information.\n", "3. Fit an GARCH(1,1) model on the pre-whitened time series $y_t$ to measure for the ARCH effects.\n", "4. Test for the arch effects in the obtained residuals $z_t$ using a second GARCH(1,1) model. \n", "\n", "### Features\n", "\n", "- The sum of squares of the first 12 autocorrelations of $\\{y_t^2\\}$.\n", "- The sum of squares of the first 12 autocorrelations of $\\{z_t^2\\}$.\n", "- The $R^2$ value of an AR model applied to $\\{y_t^2\\}$.\n", "- The $R^2$ value of an AR model applied to $\\{z_t^2\\}$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Walmart unit sales data\n", "\n", "![Walmart](./figures/M5data.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Data Structure\n", "\n", "| Hierarchy Level | Description | Number of Series |\n", "|----------------|----------------------------------------------------------------------|------------------|\n", "| 1 | All products, all stores, all states | 1 |\n", "| 2 | All products by states | 3 |\n", "| 3 | All products by store | 10 |\n", "| 4 | All products by category | 3 |\n", "| 5 | All products by department | 7 |\n", "| 6 | Unit sales of all products, aggregated for each State and category | 9 |\n", "| 7 | Unit sales of all products, aggregated for each State and department | 21 |\n", "| 8 | Unit sales of all products, aggregated for each store and category | 30 |\n", "| 9 | Unit sales of all products, aggregated for each store and department | 70 |\n", "| 10 | Unit sales of product *x*, aggregated for all stores/states | 3,049 |\n", "| 11 | Unit sales of product *x*, aggregated for each State | 9,147 |\n", "| 12 | Unit sales of product *x*, aggregated for each store | 30,490 |\n", "| **Total** | | **42,840** |\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Features for sales data \n", "\n", "| Feature | Description |\n", "|------------------------|--------------------------------------------------------------------------------------------------|\n", "| `sell_price` | Price of item in store for given date. |\n", "| `event_type` | 108 categorical events, e.g. sporting, cultural, religious. |\n", "| `event_name` | 157 event names for `event_type`, e.g. Super Bowl, Valentine's Day, President's Day. |\n", "| `event_name_2` | Name of event feature as given in competition data. |\n", "| `event_type_2` | Type of event feature as given in competition data. |\n", "| `snap_CA, TX, WI` | Binary indicator for SNAP information in CA, TX, WI. |\n", "| `release` | Release week of item in store. |\n", "\n", "- hierarchical structure of daily sales data of total $42,840$ series spanning 1,941 days" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Features for sales data \n", "\n", "| Feature | Description |\n", "|------------------------|--------------------------------------------------------------------------------------------------|\n", "| `price_max, min` | Maximum, minimum price for item in store in the train data. |\n", "| `price_mean, std, norm` | Mean, standard deviation, and normalized price for item in store in the train data. |\n", "| `item, price_nunique` | Number of unique items, prices for item in store. |\n", "| `price_diff_w` | Weekly price changes for items in store. |\n", "| `price_diff_m` | Price changes of item in store compared to its monthly mean. |\n", "| `price_diff_y` | Price changes of item in store compared to its yearly mean. |\n", "| `tm_d` | Day of month. |\n", "| `tm_w` | Week in year. |\n", "| `tm_m` | Month in year. |\n", "| `tm_y` | Year index in the train data. |\n", "| `tm_wm` | Week in month. |\n", "| `tm_dw` | Day of week. |\n", "| `tm_w_end` | Weekend indicator. |\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Visualisation features in 2D space\n", "\n", "#### t-Stochastic Neighbor Embedding (t-SNE)\n", "- Main idea: convert the distances to conditional probabilities and minimize the mismatch (kullback-Leibler divergence) between probabilities before and after the mapping.\n", "- Nonlinear and retaining both local and global structure\n", "\n", "#### PCA\n", "- Linear, and putting more emphasis on keeping dissimilar data points far apart\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Further reading\n", "\n", "- Hyndman, R. J., Wang, E., & Laptev, N. (2015). Large-scale unusual time series detection. Proceedings of the IEEE International Conference on Data Mining, 1616–1619. https://doi.org/10.1109/ICDMW.2015.104\n", "- Kang, Y., Hyndman, R. J., & Smith-Miles, K. (2017). Visualising forecasting algorithm performance using time series instance spaces. International Journal of Forecasting, 33(2), 345–358. https://doi.org/10.1016/j.ijforecast.2016.09.004" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 5 }