logo

Sufficient Statistic 📂Mathematical Statistics

Sufficient Statistic

Definitions

Mathematical Definition 1

Let the probability mass/density function of a random sample X1,,XnX_{1} , \cdots , X_{n} for parameter θΘ\theta \in \Theta be f(x;θ)f(x;\theta), and let the probability mass/density function for statistic Y1:=u1(X1,,Xn)Y_{1} := u_{1} \left( X_{1} , \cdots , X_{n} \right) be fY1(y1;θ)f_{Y_{1}} \left( y_{1}; \theta \right).

For H(x1,,xn)H \left( x_{1} , \cdots , x_{n} \right), which does not depend on θΘ\theta \in \Theta, f(x1;θ)f(xn;θ)fY1(u1(x1,,xn);θ)=H(x1,,xn) {{ f \left( x_{1} ; \theta \right) \cdots f \left( x_{n} ; \theta \right) } \over { f_{Y_{1}} \left( u_{1} \left( x_{1} , \cdots, x_{n} \right) ; \theta \right) }} = H \left( x_{1} , \cdots , x_{n} \right) then, Y1Y_{1} is called a Sufficient Statistic for θ\theta.

General Definition 2

A statistic T(X)T(\mathbf{X}) is called a Sufficient Statistic for parameter θ\theta if the conditional probability distribution of the given sample X\mathbf{X} does not depend on θ\theta.

Explanation

What the definitions mathematically imply is, intuitively, that θ\theta cancels out in both numerator and denominator―meaning that the sufficient statistic Y1Y_{1} exactly captures the information of the random sample X1,,XnX_{1} , \cdots , X_{n}. The term ‘sufficient’ indicates that the information about θ\theta is ‘sufficiently’ provided, and after removing the sufficient statistic, no information about θ\theta should remain.

To understand the concept of a sufficient statistic, let us refer to the following theorem.

Neyman Factorization Theorem: Let a random sample X1,,XnX_{1} , \cdots , X_{n} have the same probability mass/density function f(x;θ)f \left( x ; \theta \right) with respect to parameter θΘ\theta \in \Theta. A statistic Y=u1(X1,,Xn)Y = u_{1} \left( X_{1} , \cdots , X_{n} \right) is a Sufficient Statistic for θ\theta if there exist two non-negative functions k1,k20k_{1} , k_{2} \ge 0 satisfying: f(x1;θ)f(xn;θ)=k1[u1(x1,,xn);θ]k2(x1,,xn) f \left( x_{1} ; \theta \right) \cdots f \left( x_{n} ; \theta \right) = k_{1} \left[ u_{1} \left( x_{1} , \cdots , x_{n} \right) ; \theta \right] k_{2} \left( x_{1} , \cdots , x_{n} \right) Note, k2k_{2} must not depend on θ\theta.

Unintuitive Example

X1,,XnN(μ,σ2) X_{1} , \cdots , X_{n} \sim N \left( \mu , \sigma^{2} \right)

Empirically, understanding sufficient statistics usually begins from grasping why such things are calculated in the first place. A classic unintuitive example is the sufficient statistic for the population mean μ\mu of a normal distribution N(μ,σ2)N \left( \mu , \sigma^{2} \right). According to the factorization theorem, the sufficient statistic for μ\mu is k=1nf(xk;μ)=k=1n1σ2πexp((xiμ)22σ2)=(1σ2π)nexp(k=1n(xiμ)22σ2)=(1σ2π)nexp(k=1nxi22σ2)exp(k=1n(2xiμ2)2σ2)=(1σ2π)nexp(k=1nxi22σ2)exp(1σ2k=1nxi+n(μ/σ)22 )=k2(x1,,xn)k1[u1(x1,,xn);μ] \begin{align*} \prod_{k=1}^{n} f \left( x_{k} ; \mu \right) =& \prod_{k=1}^{n} {{ 1 } \over { \sigma \sqrt{2 \pi} }} \exp \left( - {{ \left( x_{i} - \mu \right)^{2} } \over { 2 \sigma^{2} }} \right) \\ =& \left( {{ 1 } \over { \sigma \sqrt{2 \pi} }} \right)^{n} \exp \left( - \sum_{k=1}^{n} {{ \left( x_{i} - \mu \right)^{2} } \over { 2 \sigma^{2} }} \right) \\ =& \left( {{ 1 } \over { \sigma \sqrt{2 \pi} }} \right)^{n} \exp \left( - \sum_{k=1}^{n} {{ x_{i}^{2} } \over { 2 \sigma^{2} }} \right) \exp \left( - \sum_{k=1}^{n} {{ \left( 2 x_{i} - \mu^{2} \right) } \over { 2 \sigma^{2} }} \right) \\ =& \left( {{ 1 } \over { \sigma \sqrt{2 \pi} }} \right)^{n} \exp \left( - \sum_{k=1}^{n} {{ x_{i}^{2} } \over { 2 \sigma^{2} }} \right) \cdot \exp \left( - {{ 1 } \over { \sigma^{2} }} \sum_{k=1}^{n} x_{i} + {{ n(\mu/\sigma)^{2} } \over { 2 \ }} \right) \\ =& k_{2} \left( x_{1} , \cdots , x_{n} \right) \cdot k_{1} \left[ u_{1} \left( x_{1} , \cdots , x_{n} \right) ; \mu \right] \end{align*} whether it is the sum of samples k=1nXk\sum_{k=1}^{n} X_{k} or the sample mean X\overline{X} obtained by multiplying numerator and denominator by nn. It’s good that, as per our intuition, the sample mean, which is an unbiased estimator, consistent estimator, and maximum likelihood estimator for μ\mu, emerges. While it makes sense mathematically, it might still feel quite abstract.

Intuitive Example

X1,,XnU(0,θ) with f(x;θ)={1,if x(0,θ)0,otherwise=1θI(0,θ)(x) X_{1} , \cdots , X_{n} \sim U (0,\theta) \text{ with } f \left( x ; \theta \right) = \begin{cases} 1 & , \text{if } x \in (0,\theta) \\ 0 & , \text{otherwise} \end{cases} = {{ 1 } \over { \theta }} I_{(0,\theta)} (x)

Consider, for example, a random sample from a uniform distribution U(a,b)U(a,b) with maximum parameter θ\theta. If the actual realization is [2.31.21.70.11.1] \begin{bmatrix}2.3 \\ 1.2 \\ 1.7 \\ 0.1 \\ 1.1\end{bmatrix} and no more samples can be obtained, since the true mean of the uniform distribution U(a,b)U(a,b) is b+a2{{ b+a } \over { 2 }}, we could consider the following estimator: θ^+02=kxkn    θ^=?2kxkn {{ \hat{\theta} + 0 } \over { 2 }} = {{ \sum_{k} x_{k} } \over { n }} \implies \hat{\theta} \overset{?}{=} {{ 2 \sum_{k} x_{k} } \over { n }} It seems like a statistically not too bad guess. In fact, multiplying the calculated sample mean by 22 gives 2.162.16, which seems plausible. However, considering 2.32.3 is in the sample, it’s impossible for θ=2.16\theta = 2.16. No matter what, θ\theta should be at least as big as 2.32.3, and intuitively, a rational estimate for θ\theta would simply be θ^=2.3\hat{\theta} = 2.3. There’s no reason to think it’s bigger than 2.32.3 based on the current sample. Let’s now actually find the sufficient statistic.

Product of Indicator Functions: i=1nI(,θ](xi)=I(,θ](maxi[n]xi) \prod_{i=1}^{n} I_{(-\infty, \theta]} \left( x_{i} \right) = I_{(-\infty, \theta]} \left( \max_{i \in [n]} x_{i} \right)

Given this lemma and the factorization theorem, the sufficient statistic for θ\theta is k=1nf(xk;μ)=k=1n1θI(0,θ)(xk)=1θnI(0,θ)(maxxk)1=k1[u1(x1,,xn);θ]k2(x1,,xn) \begin{align*} \prod_{k=1}^{n} f \left( x_{k} ; \mu \right) =& \prod_{k=1}^{n} {{ 1 } \over { \theta }} I_{(0,\theta)} \left( x_{k} \right) \\ = & {{ 1 } \over { \theta^{n} }} I_{(0,\theta)} \left( \max x_{k} \right) \cdot 1 \\ = & k_{1} \left[ u_{1} \left( x_{1} , \cdots , x_{n} \right) ; \theta \right] k_{2} \left( x_{1} , \cdots , x_{n} \right) \end{align*} thus the maximum value of the sample maxkXk=X(n)\max_{k} X_{k} = X_{(n)} can be sufficient. This simply means when considering information about θ\theta, other samples are unnecessary, and considering only maxkXk\max_{k} X_{k} is ‘sufficient’.

This idea diverges entirely from the mindset of extracting a lot of data to estimate parameters and approximating them elsewhere. It’s statistical inference approached by mathematics and formality against what seems to be mere intuitive speculation, allowing us to enter a deeper world of statistics.

Minimal Sufficient Statistic

In the Intuitive Example, we’ve intuitively confirmed that maxkXk\max_{k} X_{k} is a sufficient statistic for θ\theta. It seems there can’t be a better sufficient statistic than this, which leads us to the discussion on Minimal Sufficient Statistic.


  1. Hogg et al. (2013). Introduction to Mathematical Statistics (7th Edition): p391. ↩︎

  2. Casella. (2001). Statistical Inference (2nd Edition): p272. ↩︎