Let the probability mass/density function of a random sample X1,⋯,Xn for parameter θ∈Θ be f(x;θ), and let the probability mass/density function for statistic Y1:=u1(X1,⋯,Xn) be fY1(y1;θ).
For H(x1,⋯,xn), which does not depend on θ∈Θ,
fY1(u1(x1,⋯,xn);θ)f(x1;θ)⋯f(xn;θ)=H(x1,⋯,xn)
then, Y1 is called a Sufficient Statistic for θ.
A statistic T(X) is called a Sufficient Statistic for parameter θ if the conditional probability distribution of the given sample X does not depend on θ.
Explanation
What the definitions mathematically imply is, intuitively, that θ cancels out in both numerator and denominator―meaning that the sufficient statistic Y1 exactly captures the information of the random sample X1,⋯,Xn. The term ‘sufficient’ indicates that the information about θ is ‘sufficiently’ provided, and after removing the sufficient statistic, no information about θ should remain.
To understand the concept of a sufficient statistic, let us refer to the following theorem.
Neyman Factorization Theorem: Let a random sample X1,⋯,Xn have the same probability mass/density function f(x;θ) with respect to parameter θ∈Θ. A statistic Y=u1(X1,⋯,Xn) is a Sufficient Statistic for θ if there exist two non-negative functions k1,k2≥0 satisfying:
f(x1;θ)⋯f(xn;θ)=k1[u1(x1,⋯,xn);θ]k2(x1,⋯,xn)
Note, k2 must not depend on θ.
Unintuitive Example
X1,⋯,Xn∼N(μ,σ2)
Empirically, understanding sufficient statistics usually begins from grasping why such things are calculated in the first place. A classic unintuitive example is the sufficient statistic for the population mean μ of a normal distribution N(μ,σ2). According to the factorization theorem, the sufficient statistic for μ is
k=1∏nf(xk;μ)=====k=1∏nσ2π1exp(−2σ2(xi−μ)2)(σ2π1)nexp(−k=1∑n2σ2(xi−μ)2)(σ2π1)nexp(−k=1∑n2σ2xi2)exp(−k=1∑n2σ2(2xi−μ2))(σ2π1)nexp(−k=1∑n2σ2xi2)⋅exp(−σ21k=1∑nxi+2n(μ/σ)2)k2(x1,⋯,xn)⋅k1[u1(x1,⋯,xn);μ]
whether it is the sum of samples ∑k=1nXk or the sample mean X obtained by multiplying numerator and denominator by n. It’s good that, as per our intuition, the sample mean, which is an unbiased estimator, consistent estimator, and maximum likelihood estimator for μ, emerges. While it makes sense mathematically, it might still feel quite abstract.
Intuitive Example
X1,⋯,Xn∼U(0,θ) with f(x;θ)={10,if x∈(0,θ),otherwise=θ1I(0,θ)(x)
Consider, for example, a random sample from a uniform distribution U(a,b) with maximum parameter θ. If the actual realization is
2.31.21.70.11.1
and no more samples can be obtained, since the true mean of the uniform distribution U(a,b) is 2b+a, we could consider the following estimator:
2θ^+0=n∑kxk⟹θ^=?n2∑kxk
It seems like a statistically not too bad guess. In fact, multiplying the calculated sample mean by 2 gives 2.16, which seems plausible. However, considering 2.3 is in the sample, it’s impossible for θ=2.16. No matter what, θ should be at least as big as 2.3, and intuitively, a rational estimate for θ would simply be θ^=2.3. There’s no reason to think it’s bigger than 2.3 based on the current sample. Let’s now actually find the sufficient statistic.
Given this lemma and the factorization theorem, the sufficient statistic for θ is
k=1∏nf(xk;μ)===k=1∏nθ1I(0,θ)(xk)θn1I(0,θ)(maxxk)⋅1k1[u1(x1,⋯,xn);θ]k2(x1,⋯,xn)
thus the maximum value of the sample maxkXk=X(n) can be sufficient. This simply means when considering information about θ, other samples are unnecessary, and considering only maxkXk is ‘sufficient’.
This idea diverges entirely from the mindset of extracting a lot of data to estimate parameters and approximating them elsewhere. It’s statistical inference approached by mathematics and formality against what seems to be mere intuitive speculation, allowing us to enter a deeper world of statistics.
Minimal Sufficient Statistic
In the Intuitive Example, we’ve intuitively confirmed that maxkXk is a sufficient statistic for θ. It seems there can’t be a better sufficient statistic than this, which leads us to the discussion on Minimal Sufficient Statistic.
Hogg et al. (2013). Introduction to Mathematical Statistics (7th Edition): p391. ↩︎