Sufficient Statistic 📂Mathematical Statistics

Sufficient Statistic

Definitions

Mathematical Definition ¹

Let the probability mass/density function of a random sample $X_{1} , \cdots , X_{n}$ for parameter $\theta \in \Theta$ be $f(x;\theta)$ , and let the probability mass/density function for statistic $Y_{1} := u_{1} \left( X_{1} , \cdots , X_{n} \right)$ be $f_{Y_{1}} \left( y_{1}; \theta \right)$ .

For $H \left( x_{1} , \cdots , x_{n} \right)$ , which does not depend on $\theta \in \Theta$ , ${{ f \left( x_{1} ; \theta \right) \cdots f \left( x_{n} ; \theta \right) } \over { f_{Y_{1}} \left( u_{1} \left( x_{1} , \cdots, x_{n} \right) ; \theta \right) }} = H \left( x_{1} , \cdots , x_{n} \right)$ then, $Y_{1}$ is called a Sufficient Statistic for $\theta$ .

General Definition ²

A statistic $T(\mathbf{X})$ is called a Sufficient Statistic for parameter $\theta$ if the conditional probability distribution of the given sample $\mathbf{X}$ does not depend on $\theta$ .

Explanation

What the definitions mathematically imply is, intuitively, that $\theta$ cancels out in both numerator and denominator―meaning that the sufficient statistic $Y_{1}$ exactly captures the information of the random sample $X_{1} , \cdots , X_{n}$ . The term ‘sufficient’ indicates that the information about $\theta$ is ‘sufficiently’ provided, and after removing the sufficient statistic, no information about $\theta$ should remain.

To understand the concept of a sufficient statistic, let us refer to the following theorem.

Neyman Factorization Theorem: Let a random sample $X_{1} , \cdots , X_{n}$ have the same probability mass/density function $f \left( x ; \theta \right)$ with respect to parameter $\theta \in \Theta$ . A statistic $Y = u_{1} \left( X_{1} , \cdots , X_{n} \right)$ is a Sufficient Statistic for $\theta$ if there exist two non-negative functions $k_{1} , k_{2} \ge 0$ satisfying: $f \left( x_{1} ; \theta \right) \cdots f \left( x_{n} ; \theta \right) = k_{1} \left[ u_{1} \left( x_{1} , \cdots , x_{n} \right) ; \theta \right] k_{2} \left( x_{1} , \cdots , x_{n} \right)$ Note, $k_{2}$ must not depend on $\theta$ .

Unintuitive Example

$X_{1} , \cdots , X_{n} \sim N \left( \mu , \sigma^{2} \right)$

Empirically, understanding sufficient statistics usually begins from grasping why such things are calculated in the first place. A classic unintuitive example is the sufficient statistic for the population mean $\mu$ of a normal distribution $N \left( \mu , \sigma^{2} \right)$ . According to the factorization theorem, the sufficient statistic for $\mu$ is $\begin{align*} \prod_{k=1}^{n} f \left( x_{k} ; \mu \right) =& \prod_{k=1}^{n} {{ 1 } \over { \sigma \sqrt{2 \pi} }} \exp \left( - {{ \left( x_{i} - \mu \right)^{2} } \over { 2 \sigma^{2} }} \right) \\ =& \left( {{ 1 } \over { \sigma \sqrt{2 \pi} }} \right)^{n} \exp \left( - \sum_{k=1}^{n} {{ \left( x_{i} - \mu \right)^{2} } \over { 2 \sigma^{2} }} \right) \\ =& \left( {{ 1 } \over { \sigma \sqrt{2 \pi} }} \right)^{n} \exp \left( - \sum_{k=1}^{n} {{ x_{i}^{2} } \over { 2 \sigma^{2} }} \right) \exp \left( - \sum_{k=1}^{n} {{ \left( 2 x_{i} - \mu^{2} \right) } \over { 2 \sigma^{2} }} \right) \\ =& \left( {{ 1 } \over { \sigma \sqrt{2 \pi} }} \right)^{n} \exp \left( - \sum_{k=1}^{n} {{ x_{i}^{2} } \over { 2 \sigma^{2} }} \right) \cdot \exp \left( - {{ 1 } \over { \sigma^{2} }} \sum_{k=1}^{n} x_{i} + {{ n(\mu/\sigma)^{2} } \over { 2 \ }} \right) \\ =& k_{2} \left( x_{1} , \cdots , x_{n} \right) \cdot k_{1} \left[ u_{1} \left( x_{1} , \cdots , x_{n} \right) ; \mu \right] \end{align*}$ whether it is the sum of samples $\sum_{k=1}^{n} X_{k}$ or the sample mean $\overline{X}$ obtained by multiplying numerator and denominator by $n$ . It’s good that, as per our intuition, the sample mean, which is an unbiased estimator, consistent estimator, and maximum likelihood estimator for $\mu$ , emerges. While it makes sense mathematically, it might still feel quite abstract.

Intuitive Example

$X_{1} , \cdots , X_{n} \sim U (0,\theta) \text{ with } f \left( x ; \theta \right) = \begin{cases} 1 & , \text{if } x \in (0,\theta) \\ 0 & , \text{otherwise} \end{cases} = {{ 1 } \over { \theta }} I_{(0,\theta)} (x)$

Consider, for example, a random sample from a uniform distribution $U(a,b)$ with maximum parameter $\theta$ . If the actual realization is $\begin{bmatrix}2.3 \\ 1.2 \\ 1.7 \\ 0.1 \\ 1.1\end{bmatrix}$ and no more samples can be obtained, since the true mean of the uniform distribution $U(a,b)$ is ${{ b+a } \over { 2 }}$ , we could consider the following estimator: ${{ \hat{\theta} + 0 } \over { 2 }} = {{ \sum_{k} x_{k} } \over { n }} \implies \hat{\theta} \overset{?}{=} {{ 2 \sum_{k} x_{k} } \over { n }}$ It seems like a statistically not too bad guess. In fact, multiplying the calculated sample mean by $2$ gives $2.16$ , which seems plausible. However, considering $2.3$ is in the sample, it’s impossible for $\theta = 2.16$ . No matter what, $\theta$ should be at least as big as $2.3$ , and intuitively, a rational estimate for $\theta$ would simply be $\hat{\theta} = 2.3$ . There’s no reason to think it’s bigger than $2.3$ based on the current sample. Let’s now actually find the sufficient statistic.

Product of Indicator Functions: $\prod_{i=1}^{n} I_{(-\infty, \theta]} \left( x_{i} \right) = I_{(-\infty, \theta]} \left( \max_{i \in [n]} x_{i} \right)$

Given this lemma and the factorization theorem, the sufficient statistic for $\theta$ is $\begin{align*} \prod_{k=1}^{n} f \left( x_{k} ; \mu \right) =& \prod_{k=1}^{n} {{ 1 } \over { \theta }} I_{(0,\theta)} \left( x_{k} \right) \\ = & {{ 1 } \over { \theta^{n} }} I_{(0,\theta)} \left( \max x_{k} \right) \cdot 1 \\ = & k_{1} \left[ u_{1} \left( x_{1} , \cdots , x_{n} \right) ; \theta \right] k_{2} \left( x_{1} , \cdots , x_{n} \right) \end{align*}$ thus the maximum value of the sample $\max_{k} X_{k} = X_{(n)}$ can be sufficient. This simply means when considering information about $\theta$ , other samples are unnecessary, and considering only $\max_{k} X_{k}$ is ‘sufficient’.

This idea diverges entirely from the mindset of extracting a lot of data to estimate parameters and approximating them elsewhere. It’s statistical inference approached by mathematics and formality against what seems to be mere intuitive speculation, allowing us to enter a deeper world of statistics.

Minimal Sufficient Statistic

In the Intuitive Example, we’ve intuitively confirmed that $\max_{k} X_{k}$ is a sufficient statistic for $\theta$ . It seems there can’t be a better sufficient statistic than this, which leads us to the discussion on Minimal Sufficient Statistic.

Hogg et al. (2013). Introduction to Mathematical Statistics (7th Edition): p391. ↩︎
Casella. (2001). Statistical Inference (2nd Edition): p272. ↩︎