Differencing in Time Series Analysis
Definition 1
- Define operator $B$ as $B Y_{t} = Y_{t-1}$, referred to as Backshift.
- Define operator $\nabla$ as $\nabla := 1 - B$ and $\nabla^{r+1} = \nabla \left( \nabla^{r} Y_{t} \right)$, referred to as Differencing.
Explanation
According to the definition of differencing, the $1$th difference is calculated as $$ \nabla Y_{t} = Y_{t} - Y_{t-1} $$, and the $2$th difference is calculated as $$ \begin{align*} \nabla^2 Y_{t} =& \nabla \left( \nabla Y_{t} \right) \\ =& \nabla \left( Y_{t} - Y_{t-1} \right) \\ =& \nabla Y_{t} - \nabla Y_{t-1} \\ =& ( Y_{t} - Y_{t-1} ) - ( Y_{t-1} - Y_{t-2} ) \\ =& Y_{t} - 2 Y_{t-1} + Y_{t-2} \end{align*} $$. In other words, applying differencing twice to $Y_{t}$ does not result in $Y_{t} - Y_{t-2}$. Such extended differencing is separately defined as seasonal differencing.
The need for differencing in time series arises because it is convenient when dealing with data that has a Trend. In time series analysis, a trend refers to the ’tendency of data values to increase or decrease over a certain period’. In this case, there are concerns about stationarity. Therefore, appropriate differencing is carried out as a preprocessing step to ensure data stationarity. A single differencing may be sufficient for mere increases or decreases, but more complex shapes may require more differencing.
When unsure whether differencing is appropriate or how much to apply, the Dickey-Fuller test is commonly used both to endorse the extent of differencing needed and to justify why no further differencing is necessary.
Practice
Let’s look at the oil.price
data from the TSA
package.
oil.price
is data on crude oil prices from 1986 to 2005. There is a strong upward trend, indicating a lack of stationarity. Such data is difficult to analyze, so differencing is applied to remove the trend.
Differencing in R is very easy. Using the diff()
function returns the differenced data, dropping the first observation. Though not often used, the option lag=n
also allows for simple calculation of the $n$th difference.
Despite the fluctuations, the differenced result shows movement around $0$ on average.
Code
Below is an example R code.
library(TSA)
data(oil.price); oil.price
win.graph(4,4); plot(oil.price,main='oil.price\')
diff(oil.price)
win.graph(4,4); plot(diff(oil.price),main='∇oil.price\')
diff(oil.price,lag=2)
See Also
Cryer. (2008). Time Series Analysis: With Applications in R(2nd Edition): p90. ↩︎