logo

Box-Cox Transformation 📂Statistical Test

Box-Cox Transformation

Buildup

x>0x > 0 is referred to as a Box-Cox transformation on g(x):={xλ1λ,λ0logx,λ=0g(x) := \begin{cases} \displaystyle {{ x^{\lambda} - 1 } \over { \lambda }} & , \lambda \ne 0 \\ \log x & , \lambda = 0 \end{cases}.

gg, originally known as Power Transformation, was introduced by Box and Cox, hence it is also called the Box-Cox transformation. The main uses of the Box-Cox transformation are to make data more normal-distribution-like or to stabilize variances of the data, which is useful for preprocessing data prior to employing analysis techniques that assume normality or require stationarity. To apply a power transformation, the condition that data must be positive is necessary, but usually, this is resolved by simply adding some value to shift data so that its minimum becomes positive. If this method is unsatisfactory or unappealing, one may consider the generalized Yeo-Johnson Transformation Yeo-Johnson Transformation for the entire real numbers.

Mathematically, since limλ0xλ1λ=logx\displaystyle \lim_{\lambda \to 0} {{ x^{\lambda} - 1 } \over { \lambda }} = \log x, we can comfortably say that just knowing up to g(x)=xλ1λ\displaystyle g(x) = {{ x^{\lambda} - 1 } \over { \lambda }} is largely sufficient for recognition purposes. Also, gg is dependent on λ\lambda, so what we have is a family of functions {gλ:λR}\left\{ g_{\lambda} : \lambda \in \mathbb{R} \right\}. gλg_{\lambda} changes form flexibly according to λ\lambda, and analysts must correctly decide λ\lambda to suit their objectives.

Particularly noteworthy is λ\lambda as in 0,1,20,1,2. g0(x)=log(x)g_{0} (x) = \log(x) is the most frequently used transformation, and since g2(x)=x12\displaystyle g_{2} (x) = {{\sqrt{x} - 1} \over {2}} also h(x)=x12\displaystyle h(x) = {{x - 1} \over {2}} being a linear transformation, we can observe that g2(x)x\displaystyle g_{2} (x) \approx \sqrt{x} is discovered. The fact that favorite roots \sqrt{} and logs log\log are covered by the Box-Cox transformation is theoretically interesting and should be expected in practice.

In the case of λ=1\lambda = 1, it becomes g(x)=x1g(x) = x - 1, effectively an identity transformation. Normally, seeing a value of 11 dropped from the data is hardly considered a transformation. If the suitable λ\lambda is found to be 11 for the given data, this can be interpreted as ’no transformation necessary'.

Similarly, when considering time series analysis, calculating the confidence interval of λ\lambda in the Box-Cox transformation can be seen as hypothesis testing whether the data variance is constant. If the confidence interval of λ\lambda includes 11, it means there’s no difference whether we transform the data or not. Not needing a transformation indicates that the variance is already constant.

Hypothesis Testing

Let’s assume we have data {xt}\left\{ x_{t} \right\}.

  • H0H_{0}: λ=1\lambda = 1 that is, data {yt}\left\{ y_{t} \right\} is stationary.
  • H1H_{1}: λ1\lambda \ne 1 that is, data {yt}\left\{ y_{t} \right\} is not stationary.

It’s important to note that such diagnostics happen only regarding variance. Since it doesn’t concern the mean, a separate test is required.

Code

Practice

In R, the BoxCox.ar() function from the TSA package allows for easy hypothesis testing.

Let’s load the built-in data UKgas.

ukgas.png boxcoxukgas.png

UKgas records the quarterly consumption of gas in the UK, and as we can see, the fluctuation becomes more severe over the years. Meanwhile, the hypothesis testing result includes 00 in its confidence interval, hence the Box-Cox transformation effectively becomes a log transformation.

logukgas.png boxcoxlogukgas.png

Taking the log actually stabilizes the variance quite significantly. To verify, another round of hypothesis testing reveals that 11 is included in the confidence interval. This can be interpreted, at the confidence level of 95%95\%, that no further transformation is necessary. However, as shown in the graphs, the variance isn’t completely stabilized and since 22 is also included in the confidence interval, it can be reasonably decided to transform once more. The choice in such a situation ultimately depends on the confidence level, and if the confidence level is 95%95 \%, it’s up to the analyst to decide.

Full Code

UKgas
win.graph(3.5,3.5); plot(UKgas,main='UKgas')
win.graph(3.5,3.5); BoxCox.ar(UKgas)
win.graph(3.5,3.5); plot(log(UKgas),main='log(UKgas)')
win.graph(3.5,3.5); BoxCox.ar(log(UKgas))