Time Series Forecasting with ARIMA¶

Feng Li

School of Statistics and Mathematics

Central University of Finance and Economics

feng.li@cufe.edu.cn

https://feng.li/python

ARIMA models¶

Have been a popular forecasting tool since the 1970s
Why do they work?
- ARMA is a flexible and parsimonious way to represent any stationary time series.
- Non-stationary data can be differenced to make it stationary (hence ARIMA).
- Maximum likelihood can be used to estimate stationary ARMA models.
- Forecasts from such models are easy to generate.
First question: What is stationarity?

Expectations in time series¶

Consider the time series

$$Y_1,Y_2,\dots,Y_T$$

What do we mean by the expected value $E(Y_t)$?
Let the density of $Y_t$ be $f_{Y_t}$. The expected value is given by

$$E(Y_t)=\int y_tf_{Y_t}(y_t)dy_t$$

This is the unconditional expected value.

An Illustration¶

Imagine $I$ multiverses each with different realisations of the time series

$$\begin{aligned}\textrm{Multiverse 1:}\,&y^{(1)}_1,y^{(1)}_2\dots, y^{(1)}_t,\dots,y^{(1)}_{T}\\\textrm{Multiverse 2:}\,& y^{(2)}_1,y^{(2)}_2\dots, y^{(2)}_t,\dots,y^{(2)}_{T}\\&\vdots\quad\vdots\quad\vdots\quad\vdots\quad\vdots\quad\vdots\\\textrm{Multiverse I:}\,& y^{(I)}_1,y^{(I)}_2\dots, y^{(I)}_t,\dots,y^{(I)}_{T}\end{aligned}$$

Let the number of multiverses go to infinity and average observation $t$ over each multiverse.

$$E(Y_t)=\underset{I\rightarrow\infty}{\lim}\frac{1}{I}\sum\limits_{i=1}^{I}y^{(i)}_t$$

Other concepts¶

For some time series the following holds

$$E(Y_t)=\underset{T\rightarrow\infty}{\lim}\frac{1}{T}\sum\limits_{t=1}^{T}y^{(i)}_t$$

These time series are called ergodic for the mean.

Also do not confuse conditional and unconditional expections, i.e.

$$E(Y_t)\,\textrm{ is not equivalent to }\,E(Y_t|Y_{t-1},Y_{t-2},\dots)$$

Variance and (Auto-)Covariance in Time Series¶

The variance of $Y_t$ is given by

$$\textrm{Var}(Y_t)=E\left[(y_t-\mu_t)^2\right]$$

Here $\mu_t=E(Y_t)$ is just shorthand

The covariance between $Y_t$ and $Y_{t-k}$ is given by

$$\textrm{Cov}(Y_t,T_{t-k})=E\left[(y_t-\mu_t)(y_{t-k}-\mu_{t-k})\right]$$

Together, the variance and covariance make up the first two moments of the time series.
Both can be defined using an integral, but also via the "multiverses" thought experiment.

Variance Covariance Matrix¶

It is often convenient to put all variances and covariances into a matrix

$$\textrm{Var-Cov}(\mathbf{Y})=\begin{bmatrix}\textrm{Var}(Y_1) & \textrm{Cov}(Y_1,Y_2) &\cdots& \textrm{Cov}(Y_1,Y_T) \\ \textrm{Cov}(Y_1,Y_2)&\textrm{Var}(Y_2) &\cdots& \textrm{Cov}(Y_2,Y_T)\\\vdots&\vdots&\ddots&\vdots \\\textrm{Cov}(Y_1,Y_T)&\textrm{Cov}(Y_2,Y_T) &\cdots& \textrm{Var}(Y_T)\end{bmatrix}$$

This is called the variance-covariance matrix.

Autocorrelation function¶

Correlations are covariances divided by square root of variance.
The correlation as a function of $k$ is known as the autocorrelation function (ACF).
This gives an indication of the relationship between $Y_t$ and $Y_{t-k}$.
The autocorrelation function can be estimated from a sample of data.

Partial autocorrelation function¶

A similar concept is partial autocorrelation.
This is a conditional expectation

$$E\left[(Y_{t}-\mu_t)(Y_{t-k}-\mu_{t-k})|Y_{t-1},Y_{t-2},\dots,Y_{t-k+1}\right]$$

This is the "pure" effect of $Y_{t-k}$ on $Y_t$ controlling for all intervening observations.
It can be understood by analogy to a regression model.

Weak Stationarity¶

If the following properties hold:

$E(Y_t)=\mu$: Expected value not a function of time
$V(Y_t)=\sigma^2$: Variance not a function of time
$V(Y_t,Y_{t-k})=\gamma_k$: Covariance not a function of time (It can be a function of $k$)

Then a series is weakly (or covariance) stationary

Strong stationarity¶

If the following property holds:

$$f(Y_{t},Y_{t+{j_1}},Y_{t+{j_1}},\dots,Y_{t+{j_n}}) \textrm{ only depends on } (j_1,j_2,\dots,j_n) \textrm { and not on time}$$

Then a series is strongly stationary

Strong stationarity always implies weak stationarity.
The reverse is true for Gaussian distributions but not in general.

Examples¶

Consider two processes:

$Y_t=0.2\times t+\epsilon_t$
$Y_t=\sum\limits_{i=1}^t\epsilon_i$

In both cases $\epsilon_t\overset{i.i.d}\sim N(0,\sigma^2)$

Example 1¶

For the first process

$$\begin{aligned}E(Y_t)&=E(0.2t+\epsilon_t)\\&=E(0.2t)+E(\epsilon_t)\\&=0.2t+0=0.2t\end{aligned}$$

Expected value depends on time. The series is nonstationary

Your turn: does the variance depend on time?

Example 2¶

For the second process:

$$\begin{aligned}E(Y_t)&=E\left(\sum\limits_{i=1}^t\epsilon_i\right)\\&=\sum\limits_{i=1}^tE(\epsilon_i)\\&=\sum\limits_{i=1}^t0=0\end{aligned}$$

Your turn: does the variance depend on time? Is this stationary or nonstationary?

Simulation: Example 1¶

In [1]:

import numpy as np
T=200
I=8
y_ex1 = np.random.standard_normal((I,T)) 
trend = 0.2*(np.arange(0,T)+1)
for i in range(I):
    y_ex1[i,] = y_ex1[i,] + trend
    
y_ex1

Out[1]:

array([[ 1.99181191,  0.74873099,  0.76237771, ..., 40.66073836,
        40.0212313 , 41.0014318 ],
       [ 0.39736702,  1.91180765,  0.90328169, ..., 40.89142876,
        40.656141  , 38.17984535],
       [-1.98823551,  0.49968108,  2.86359707, ..., 39.80940666,
        39.47742127, 37.98607156],
       ...,
       [ 0.16828452,  0.24859632, -0.81146117, ..., 38.99058873,
        39.83864486, 41.35192786],
       [-2.06178154,  0.92476329,  0.06123239, ..., 41.06975236,
        38.93200457, 40.01663883],
       [-0.27183388, -1.55051855,  0.17579591, ..., 39.6672071 ,
        40.98419272, 41.60796254]])

Plots: Example 1¶

In [2]:

import matplotlib.pyplot as plt 
fig, ax = plt.subplots(figsize=(25, 6))
for i in range(I):
    #ax = f.add_subplot(I/2, 2, i + 1)
    ax.plot(range(1,T+1),y_ex1[i,])
    ax.axhline(y=0,color='red')
    ax.axvline(x=T,color='black')
    ax.axvline(x=1,color='black')

Simulation: Example 2¶

In [3]:

y_ex2 = np.random.standard_normal((I,T)) 
y_ex2 = np.cumsum(y_ex2,1)

Plots: Example 2¶

In [4]:

fig, ax = plt.subplots(figsize=(25, 6))
for i in range(I):
    #ax = f.add_subplot(I/2, 2, i + 1)
    ax.plot(range(1,T+1),y_ex2[i,])
    ax.axhline(y=0,color='red')
    ax.axvline(x=T,color='black')
    ax.axvline(x=1,color='black')
    

Differencing¶

It is much easier to handle stationary time series.
However non-stationary series can be made stationary via differencing.
Consider example 2 above

$$\begin{aligned}Y_t-Y_{t-1}&=\sum\limits_{i=1}^t\epsilon_i - \sum\limits_{i=1}^{t-1}\epsilon_i \\&=(\epsilon_t+\epsilon_{t-1}+\dots+\epsilon_1)-(\epsilon_{t-1}+\dots+\epsilon_1)\\&=\epsilon_t \end{aligned}$$

Tests for stationarity¶

With only one realisation of raw data, how can we determine whether the time series is stationary?
Several hypothesis tests exist for this purpose
- Augmented Dickey Fuller
- KPSS Test
- Phillips Perron
For ADF null is that series are non-stationary, for KPSS and PP it is that they are stationary.

Moving Average model¶

An MA(q) model is given by:

$Y_t=\epsilon_t+\theta_1\epsilon_{t-1}+\theta_2\epsilon_{t-2}+\dots+\theta_q\epsilon_{t-q}$

Or using lag operators:

$Y_t=(1+\theta_1L+\theta_2L^2+\dots+\theta_qL^q)\epsilon_t$

Or a more compact shorthand:

$Y_t=(1+\Theta(L))\epsilon_t$

We can (and often do) add an intercept, but I will leave it off for simplicity.

Example¶

You run a restaurant and sales is heavily influenced by vouchers issued by third parties.
You do not observe the number of vouchers issued so treat it as random.
The voucher may be valid for two days. There is
- an instantaneous effect of vouchers issued on sales, and
- a 'discount' factor for the effect of vouchers on the second day.

$$\textrm{Sales}_t=\textrm{Voucher Effect}_t+\theta\textrm{Voucher Effect}_{t-1}$$

Simulation¶

In [5]:

T=200
y=np.zeros(T)
epsilon=np.random.normal(0,0.1,T)
for i in range(1,T):
    y[i]=epsilon[i]+0.8*epsilon[i-1]
fig, ax = plt.subplots(figsize=(25, 6))
ax.plot(np.arange(T),y)

Out[5]:

[<matplotlib.lines.Line2D at 0x7f90ddb552e0>]

ACF and PACF of MA¶

In [6]:

from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
fig, ax = plt.subplots(1,2,figsize=(15,6))
plot_acf(y,ax=ax[0])
plot_pacf(y,ax=ax[1])
plt.show()

ACF and PACF in theory¶

The ACF is zero beyond lag $q$ for an MA(q).
The PACF is decays more slowly.
Remember what we see in the plots are sample estimates. But even then
- The ACF is not significantly different from zero after lag $q$.
- The PACF does have significant zeros after lag $q$

Wold's Theorem¶

It holds that any stationary process can be represented as an MA($\infty$) by Wold's Theorem.
Even a process that is non-linear can be written as a linear combination of past errors.
- Linear is good due to ease of estimation
- Past errors are good since the model can be used for forecasting
The disadvantage is that a finite order MA is restrictive.
- For example the ACF is zero past laq $q$
Is there another simple model that has an non-zero ACF up to infinite order?

Autoregressive model¶

An AR(p) model is given by

$Y_t=\phi_1Y_{t-1}+\phi_2Y_{t-2}+\dots+\phi_pY_{t-p}+\epsilon_t$

Or using lag operators

$(1-\phi_1L-\phi_2L^2-\dots-\phi_pL^p)Y_t=\epsilon_t$

Or a more compact shorthand

$(1-\Phi(L))Y_t=\epsilon_t$

Example¶

Consider the restaurant's price for a bowl of noodle soup. This will be influenced by factors unobserved to the restaurant owner, e.g. the price of noodles, changes to rent, gas bills, etc.
However, the restaurant doesn't want to change the price dramatically, so the previous price will also have an important effect of the current price.

$$\textrm{Price}_t=\phi \textrm{Price}_{t-1}+\text{Other Shocks}_t$$

This is an AR(1) model.

Simulation¶

In [7]:

T=200
y=np.zeros(T)
epsilon=np.random.normal(0,0.1,T)
for i in range(1,T):
    y[i]=0.8*y[i-1]+epsilon[i]
fig, ax = plt.subplots(figsize=(25, 6))
ax.plot(np.arange(T),y)

Out[7]:

[<matplotlib.lines.Line2D at 0x7f90b55afe20>]

ACF and PACF of AR¶

In [8]:

fig, ax = plt.subplots(1,2,figsize=(15,6))
plot_acf(y,ax=ax[0])
plot_pacf(y,ax=ax[1])
plt.show()

ACF and PACF in theory¶

The PACF is zero beyond lag $p$ for an AR(p).
The ACF decays more slowly.
Remember what we see in the plots are sample estimates. But even then
- The PACF is not significantly different from zero after lag $p$.
- The ACF does have significant zeros after lag $p$

Looking for more flexibility¶

The AR model implies zeros past some point in the PACF
The MA model implies zeros past some point in the ACF
Is there is a more flexible model that
- Keeps a finite number of parameters
- Does not constrain the ACF and PACF

ARMA model¶

An ARMA(p,q) model is given by:

$Y_t=\phi_1Y_{t-1}+\phi_2Y_{t-2}+\dots+\phi_pY_{t-p}+\epsilon_t+\theta_1\epsilon_{t-1}+\theta_2\epsilon_{t-2}+\dots+\theta_q\epsilon_{t-q}$

Or using lag operators:

$(1-\phi_1L-\phi_2L^2-\dots-\phi_pL^p)Y_t=(1+\theta_1L+\theta_2L^2+\dots+\theta_qL^q)\epsilon_t$

Or a more compact shorthand:

$(1-\Phi(L))Y_t=(1+\Theta(L))\epsilon_t$

Your turn: simulate an ARMA(1,1) and find its ACF and PACF

Simulation¶

In [9]:

T=200
y=np.zeros(T)
epsilon=np.random.normal(0,0.1,T)
for i in range(1,T):
    y[i]=0.8*y[i-1]+epsilon[i]+0.7*epsilon[i-1]
fig, ax = plt.subplots(figsize=(25, 6))
ax.plot(np.arange(T),y)

Out[9]:

[<matplotlib.lines.Line2D at 0x7f90b5411be0>]

ACF and PACF of ARMA(1,1)¶

In [10]:

from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
fig, ax = plt.subplots(1,2,figsize=(15,6))
plot_acf(y,ax=ax[0])
plot_pacf(y,ax=ax[1])
plt.show()

Summarise¶

Any stationary time series can be written as an MA($\infty$).
ARMA models are a very flexible class of
- linear models with
- a finite number of parameters
- that are very flexible
- that can be easily estimated
- and used for forecasting.
This is the theoretical justification for ARIMA models.

Putting `I` in ARIMA¶

The above only applies to stationary series
However we can difference data $d$ times to make data stationary. The final ARIMA model is

$$(1-\Phi(L))(1-L)^d Y_t=(1+\Theta(L))\epsilon_t$$

We also write that a model is ARIMA(p,d,q) where
- $p$ is the number of AR terms
- $d$ is the number of times we difference data
- $q$ is the number of MA terms

Box Jenkins¶

The Box Jenkins method has the following steps

Use unit root tests to determine whether a series is non-stationary.
- If the series is non-stationary difference the series.
- Keep going until the series is differenced enough to be stationary.
Inspect the ACF and PACF.
- If the ACF is zero after some lag $q$ use an MA(q)
- If the PACF is zero after some lag $p$ use an AR(p)
- Otherwise use an ARMA model

Problems with Box-Jenkins¶

Looking at ACF and PACF plots is subjective.
Difficult to apply to a large number of time series
Multiple testing problems.
Power can be low
No good guide on selecting ARMA order.

Time Series Forecasting with ARIMA¶

ARIMA models¶

Expectations in time series¶

An Illustration¶

Other concepts¶

Variance and (Auto-)Covariance in Time Series¶

Variance Covariance Matrix¶

Autocorrelation function¶

Partial autocorrelation function¶

Weak Stationarity¶

Strong stationarity¶

Examples¶

Example 1¶

Example 2¶

Simulation: Example 1¶

Plots: Example 1¶

Simulation: Example 2¶

Plots: Example 2¶

Differencing¶

Tests for stationarity¶

Moving Average model¶

Example¶

Simulation¶

ACF and PACF of MA¶

ACF and PACF in theory¶

Wold's Theorem¶

Autoregressive model¶

Example¶

Simulation¶

ACF and PACF of AR¶

ACF and PACF in theory¶

Looking for more flexibility¶

ARMA model¶

Simulation¶

ACF and PACF of ARMA(1,1)¶

Summarise¶

Putting I in ARIMA¶

Box Jenkins¶

Problems with Box-Jenkins¶

Putting `I` in ARIMA¶