Autocorrelation Function
Definition 1
Let’s say $\left\{ Y_{t} \right\}_{t=1}^{n}$ is a stochastic process.
- $\mu_{t} := E ( Y_{t} )$ is called the mean function.
- The following defined $\gamma_{ t , s }$ is called the autocovariance function. $$ \gamma_{t , s} : = \text{cov} ( Y_{t} , Y_{s} ) = E ( Y_{t} - \mu_{t} ) E ( Y_{s} - \mu_{s} ) $$
- The following defined $\rho_{ t , s }$ is called the autocorrelation function. $$ \rho_{ t , s } := \text{cor} ( Y_{t} , Y_{s} ) = {{ \gamma_{t , s} } \over { \sqrt{ \gamma_{t , t} \gamma_{s , s} } }} $$
- The following defined $\rho_{ k }$ is called the autocorrelation function for lag $k$. $$ \rho_{ k } := \text{cor} ( Y_{t} , Y_{t-k} ) = {{ \gamma_{t , t - k} } \over { \sqrt{ \gamma_{t , t} \gamma_{t-k , t-k} } }} $$
- The following defined $r_{ k }$ is called the sample autocorrelation function for lag $k$. $$ r_{ k } := {{ \sum_{t = k+1}^{n} \left( Y_{t} - \overline{Y} \right) \left( Y_{t-k} - \overline{Y} \right) } \over { \sum_{t=1}^{n} \left( Y_{t} - \overline{Y} \right)^2 }} $$
Explanation
Autocorrelation function is a function for understanding the auto-correlation of time series data, focusing on how similar to itself it is, even if it’s the same variable, but at a certain lag. In contrast to the idea of regression analysis, which is interested in the correlation between different variables, it treats itself as divided into $Y_{t}$ and $Y_{t-k}$ at lag $k$ like two variables.
Mathematical Explanation
Mathematically, if we think $Y_{t}$ came from $MA(q)$, then since it is $\displaystyle Y_{t} = e_{t} - \sum_{k=1}^{q} \theta_{k} e_{t-k}$, $Y_{t}$ can be viewed as a sum of several normal distributions, and since $\rho_{k}$ equals to $\theta_{k}$, it is useful for finding the $MA(q)$ model.
sACF $r_{k}$ is an estimate of ACF $\rho_{k}$, and if $Y_{t}$ came from the $MA(q)$ model, then when $k > q$, it follows the normal distribution $\displaystyle N \left( \rho_{k} , {{1} \over {n}} \left[ 1 + 2 \sum_{j=1}^{q} \rho_{j}^{2} \right]^2 \right)$. Expressed mathematically, it is $$ r_{k} \sim N \left( \rho_{k} , {{1} \over {n}} \left[ 1 + 2 \sum_{j=1}^{q} \rho_{j}^{2} \right]^2 \right) $$ which is used for hypothesis testing.
Tests
Given $\displaystyle Y_{t} = e_{t} - \sum_{k=1}^{q} \theta_{k} e_{t-k}$ and assume $k = 1 , \cdots , q$.
- $H_{0}$: $MA(0) \iff \theta_{k} = 0$, namely, $Y_{t}$ does not follow the moving average model.
- $H_{1}$: $MA(k) \iff \theta_{k} \ne 0$, namely, $Y_{t}$ has an autocorrelation at lag $k$.
Interpretation
Under the null hypothesis, since $\rho_{k} = \theta_{k} = 0$ for all $k$, assume $q = 0$ and $\displaystyle r_{k} \sim N \left( 0 , {{1} \over {N }} \right)$, and the standard error becomes $\displaystyle {{1} \over {\sqrt{n} }}$. Therefore, if you want to conduct a hypothesis test at the significance level $\alpha$, check whether $| \theta_{k} |$ exceeds the upper confidence limit $\displaystyle {{ z_{1 - \alpha/2} } \over { \sqrt{n} }}$. If it exceeds, it becomes a candidate for significant lag; if not, it is considered to have no autocorrelation.
Practice
The ma1.2.s
data is a sample data from the TSA
package derived from the $MA(1)$ model. When analyzing with the actual ARIMA model, the significance of the coefficient is also determined based on whether the absolute value of the estimate exceeds twice the standard error.
Using the acf()
function of the TSA
package, it produces a correlogram for various $k$ like the above. Without having to calculate in your head, if it exceeds the line, it is considered significant; if not, it is considered not significant. It is calculated at a default significance level $5 \%$.
Note that, even if it slightly exceeds $k=6$, it is statistically significant, but it is not considered to have an actual autocorrelation. Such cases of slight exceeding are very common in time series analysis, and for mental health, it is recommended to show flexibility and accept it as it is.
Drawing lines yourself as shown above is recommended as a way to confirm if you properly understood hypothesis testing using the autocorrelation function. With just one line of code in R, by running it at least once yourself, you can accept that $r_{k}$ follows a normal distribution, and the standard error is calculated as $\displaystyle \text{se} ( r_{k} ) = {{1} \over {\sqrt{n}}}$ without complicated formulas.
Code
library(TSA)
data(ma1.2.s); win.graph(6,4); acf(ma1.2.s)
arima(ma1.2.s, order=c(0,0,1))
abline(h=1.96*1/sqrt(length(ma1.2.s)),col='red')
See Also
- PACF: Partial Autocorrelation Function
- EACF: Extended Autocorrelation Function
- CCF: Cross-Correlation Function
Cryer. (2008). Time Series Analysis: With Applications in R(2nd Edition): p11, 109. ↩︎