{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "9e8e550f-c673-4d14-a2e2-1e36692daaf1",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "# Time Series Features\n",
    "\n",
    "\n",
    "## Feng Li\n",
    "\n",
    "### Guanghua School of Management\n",
    "### Peking University\n",
    "\n",
    "\n",
    "### [feng.li@gsm.pku.edu.cn](feng.li@gsm.pku.edu.cn)\n",
    "### Course home page: [https://feng.li/bdcf](https://feng.li/bdcf)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "80f075fb-9283-441c-bece-cbeb74d7d45d",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "## Why do we need time series features? --- The No-Free-Lunch theorem \n",
    "\n",
    "- There is never universally best method that fits in all situations.\n",
    "\n",
    "- The explosion of new algorithms development makes the question even more worth focusing.\n",
    "\n",
    "- No single forecasting method stands out the best for any type of time series.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ea481272-34af-4808-8bad-589f3eacc067",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "\n",
    "## Literature \n",
    "\n",
    "- Features of time series $\\rightarrow$ benefits in producing more accurate forecasting accuracies \n",
    "\n",
    "- Features $\\rightarrow$  forecasting method selection rules\n",
    "\n",
    "- \"Horses for courses\" $\\rightarrow$ effects of time series features to the forecasting performances\n",
    "\n",
    "-  Visualize the performances of different forecasting methods $\\rightarrow$ better understanding of their relative performances"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1f2605e7-ae23-483c-8bc5-de52f0c0114c",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "## Existing problems \n",
    "\n",
    "- inadequate features\n",
    "\n",
    "- limited training time series data (not only in number, but in diversity)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cd996637-ccca-446a-a479-9aa0e7495d87",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "## Questions to be answered\n",
    "\n",
    "- What time series features should be used?\n",
    "- How to construct time series features?\n",
    "- How to visualize time series features by projection?\n",
    "- How to model features and forecasting methods?\n",
    "- How to generate new time series with certain features?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fd100032-efcf-4238-ac52-48c7d9f383fe",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "## Time series features\n",
    "\n",
    "### Basic idea\n",
    "\n",
    "Transform a given time series $\\{x_1, x_2, \\cdots, x_n\\}$ to a feature vector $F = (F_1, F_2, \\cdots, F_p)$. \n",
    "\n",
    "#### A feature $F_k$ can be any kind of function computed from a time series:\n",
    "\n",
    "1. A simple mean\n",
    "2. The parameter of a fitted model\n",
    "3. Some statistic intended to highlight an attribute of the data\n",
    "4. ...\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b413b141-e4d9-4fb2-97a2-feab698adfda",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "## Which features should  we use?\n",
    "\n",
    "- There does not exist the best feature representation of a time series.\n",
    "- Depends on both the **nature** of the time series being analysed, and the **purpose** of the analysis. \n",
    "\n",
    "    - With unit roots, the mean is not a meaningful feature without some constraints on the initial values. \\pause\n",
    "    \n",
    "    - CPU usage every minute for a large number of servers: we observe a daily seasonality. The mean may provide useful comparative information despite the time series not being stationary."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b2041278-4a50-4950-b1be-0f111b86c1ed",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "- Time series are of different lengths, on different scales, and with different properties.\n",
    "- We restrict our features to be ergodic, stationary and independent of scale. \n",
    "- 17 sets of diverse features.\n",
    "- New features are intended to measure attributes associated with multiple seasonality, non-stationarity and heterogeneity of the time series."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6c1f9ef8-7389-4e73-ba7c-1fcaad759b6c",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "## Features for multiple seasonal time series \n",
    "\n",
    "### STL decompostion extension\n",
    "$$ x_t = f_t + s_{1,t} + s_{2,t} + \\cdots + s_{M,t} + e_t.$$\n",
    "The strength of trend can be measured by:\n",
    "$$\n",
    "    F_{10} = 1- \\frac{\\text{var}(e_t)}{\\text{var}(f_t + e_t)}.\n",
    "$$\n",
    "\n",
    "The strength of seasonality for the $i$th seasonal component:\n",
    "\n",
    "$$\n",
    "F_{11,i} = 1- \\frac{\\text{var}(e_t)}{\\text{var}(s_{i,t} + e_t)}.\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bfe14497-62a7-4af2-ad10-a8e0f6e86460",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "## Features on heterogenity\n",
    "\n",
    "1. Pre-whiten the time series $x_t$ to remove the mean, trend, and Autoregressive (AR) information.\n",
    "3. Fit an GARCH(1,1) model on the pre-whitened time series $y_t$ to measure for the ARCH effects.\n",
    "4. Test for the arch effects in the obtained residuals $z_t$ using a second GARCH(1,1) model. \n",
    "\n",
    "### Features\n",
    "\n",
    "- The sum of squares of the first 12 autocorrelations of $\\{y_t^2\\}$.\n",
    "- The sum of squares of the first 12 autocorrelations of $\\{z_t^2\\}$.\n",
    "- The $R^2$ value of an AR model applied to $\\{y_t^2\\}$.\n",
    "- The $R^2$ value of an AR model applied to $\\{z_t^2\\}$."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7cdace3a-1d4d-4b03-a864-a1622261d324",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "## Walmart unit sales data\n",
    "\n",
    "![Walmart](./figures/M5data.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "499e93f9-ce9e-48b9-85fa-15a9c4d8189f",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "## Data Structure\n",
    "\n",
    "| Hierarchy Level | Description                                                          | Number of Series |\n",
    "|----------------|----------------------------------------------------------------------|------------------|\n",
    "| 1              | All products, all stores, all states                                 | 1                |\n",
    "| 2              | All products by states                                              | 3                |\n",
    "| 3              | All products by store                                               | 10               |\n",
    "| 4              | All products by category                                           | 3                |\n",
    "| 5              | All products by department                                         | 7                |\n",
    "| 6              | Unit sales of all products, aggregated for each State and category  | 9                |\n",
    "| 7              | Unit sales of all products, aggregated for each State and department | 21               |\n",
    "| 8              | Unit sales of all products, aggregated for each store and category  | 30               |\n",
    "| 9              | Unit sales of all products, aggregated for each store and department | 70               |\n",
    "| 10             | Unit sales of product *x*, aggregated for all stores/states         | 3,049            |\n",
    "| 11             | Unit sales of product *x*, aggregated for each State                | 9,147            |\n",
    "| 12             | Unit sales of product *x*, aggregated for each store                | 30,490           |\n",
    "| **Total**      |                                                                      | **42,840**       |\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "388d9016-d0c6-4d1e-86f0-fda5097c8ef7",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "## Features for sales data \n",
    "\n",
    "| Feature                | Description                                                                                      |\n",
    "|------------------------|--------------------------------------------------------------------------------------------------|\n",
    "| `sell_price`          | Price of item in store for given date.                                                           |\n",
    "| `event_type`         | 108 categorical events, e.g. sporting, cultural, religious.                                      |\n",
    "| `event_name`         | 157 event names for `event_type`, e.g. Super Bowl, Valentine's Day, President's Day.             |\n",
    "| `event_name_2`       | Name of event feature as given in competition data.                                              |\n",
    "| `event_type_2`       | Type of event feature as given in competition data.                                              |\n",
    "| `snap_CA, TX, WI`   | Binary indicator for SNAP information in CA, TX, WI.                                             |\n",
    "| `release`            | Release week of item in store.                                                                   |\n",
    "\n",
    "- hierarchical structure of daily sales data of total $42,840$ series spanning 1,941 days"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a4a603d4-95c6-42d5-b9b6-b88e7b4ede5a",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "## Features for sales data \n",
    "\n",
    "| Feature                | Description                                                                                      |\n",
    "|------------------------|--------------------------------------------------------------------------------------------------|\n",
    "| `price_max, min`     | Maximum, minimum price for item in store in the train data.                                     |\n",
    "| `price_mean, std, norm` | Mean, standard deviation, and normalized price for item in store in the train data.            |\n",
    "| `item, price_nunique` | Number of unique items, prices for item in store.                                               |\n",
    "| `price_diff_w`       | Weekly price changes for items in store.                                                         |\n",
    "| `price_diff_m`       | Price changes of item in store compared to its monthly mean.                                     |\n",
    "| `price_diff_y`       | Price changes of item in store compared to its yearly mean.                                      |\n",
    "| `tm_d`               | Day of month.                                                                                    |\n",
    "| `tm_w`               | Week in year.                                                                                    |\n",
    "| `tm_m`               | Month in year.                                                                                   |\n",
    "| `tm_y`               | Year index in the train data.                                                                    |\n",
    "| `tm_wm`              | Week in month.                                                                                   |\n",
    "| `tm_dw`              | Day of week.                                                                                      |\n",
    "| `tm_w_end`           | Weekend indicator.                                                                               |\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7dc5010b-9551-4285-bf07-d2da161adf31",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "## Visualisation features in 2D space\n",
    "\n",
    "#### t-Stochastic Neighbor Embedding (t-SNE)\n",
    "- Main idea: convert the distances to conditional probabilities and minimize the mismatch (kullback-Leibler divergence) between probabilities before and after the mapping.\n",
    "- Nonlinear and retaining both local and global structure\n",
    "\n",
    "#### PCA\n",
    "- Linear, and putting more emphasis on keeping dissimilar data points far apart\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}