Sufficient Statistic
Definitions
Mathematical Definition 1
Let the probability mass/density function of a random sample $X_{1} , \cdots , X_{n}$ for parameter $\theta \in \Theta$ be $f(x;\theta)$, and let the probability mass/density function for statistic $Y_{1} := u_{1} \left( X_{1} , \cdots , X_{n} \right)$ be $f_{Y_{1}} \left( y_{1}; \theta \right)$.
For $H \left( x_{1} , \cdots , x_{n} \right)$, which does not depend on $\theta \in \Theta$, $$ {{ f \left( x_{1} ; \theta \right) \cdots f \left( x_{n} ; \theta \right) } \over { f_{Y_{1}} \left( u_{1} \left( x_{1} , \cdots, x_{n} \right) ; \theta \right) }} = H \left( x_{1} , \cdots , x_{n} \right) $$ then, $Y_{1}$ is called a Sufficient Statistic for $\theta$.
General Definition 2
A statistic $T(\mathbf{X})$ is called a Sufficient Statistic for parameter $\theta$ if the conditional probability distribution of the given sample $\mathbf{X}$ does not depend on $\theta$.
Explanation
What the definitions mathematically imply is, intuitively, that $\theta$ cancels out in both numerator and denominator―meaning that the sufficient statistic $Y_{1}$ exactly captures the information of the random sample $X_{1} , \cdots , X_{n}$. The term ‘sufficient’ indicates that the information about $\theta$ is ‘sufficiently’ provided, and after removing the sufficient statistic, no information about $\theta$ should remain.
To understand the concept of a sufficient statistic, let us refer to the following theorem.
Neyman Factorization Theorem: Let a random sample $X_{1} , \cdots , X_{n}$ have the same probability mass/density function $f \left( x ; \theta \right)$ with respect to parameter $\theta \in \Theta$. A statistic $Y = u_{1} \left( X_{1} , \cdots , X_{n} \right)$ is a Sufficient Statistic for $\theta$ if there exist two non-negative functions $k_{1} , k_{2} \ge 0$ satisfying: $$ f \left( x_{1} ; \theta \right) \cdots f \left( x_{n} ; \theta \right) = k_{1} \left[ u_{1} \left( x_{1} , \cdots , x_{n} \right) ; \theta \right] k_{2} \left( x_{1} , \cdots , x_{n} \right) $$ Note, $k_{2}$ must not depend on $\theta$.
Unintuitive Example
$$ X_{1} , \cdots , X_{n} \sim N \left( \mu , \sigma^{2} \right) $$
Empirically, understanding sufficient statistics usually begins from grasping why such things are calculated in the first place. A classic unintuitive example is the sufficient statistic for the population mean $\mu$ of a normal distribution $N \left( \mu , \sigma^{2} \right)$. According to the factorization theorem, the sufficient statistic for $\mu$ is $$ \begin{align*} \prod_{k=1}^{n} f \left( x_{k} ; \mu \right) =& \prod_{k=1}^{n} {{ 1 } \over { \sigma \sqrt{2 \pi} }} \exp \left( - {{ \left( x_{i} - \mu \right)^{2} } \over { 2 \sigma^{2} }} \right) \\ =& \left( {{ 1 } \over { \sigma \sqrt{2 \pi} }} \right)^{n} \exp \left( - \sum_{k=1}^{n} {{ \left( x_{i} - \mu \right)^{2} } \over { 2 \sigma^{2} }} \right) \\ =& \left( {{ 1 } \over { \sigma \sqrt{2 \pi} }} \right)^{n} \exp \left( - \sum_{k=1}^{n} {{ x_{i}^{2} } \over { 2 \sigma^{2} }} \right) \exp \left( - \sum_{k=1}^{n} {{ \left( 2 x_{i} - \mu^{2} \right) } \over { 2 \sigma^{2} }} \right) \\ =& \left( {{ 1 } \over { \sigma \sqrt{2 \pi} }} \right)^{n} \exp \left( - \sum_{k=1}^{n} {{ x_{i}^{2} } \over { 2 \sigma^{2} }} \right) \cdot \exp \left( - {{ 1 } \over { \sigma^{2} }} \sum_{k=1}^{n} x_{i} + {{ n(\mu/\sigma)^{2} } \over { 2 \ }} \right) \\ =& k_{2} \left( x_{1} , \cdots , x_{n} \right) \cdot k_{1} \left[ u_{1} \left( x_{1} , \cdots , x_{n} \right) ; \mu \right] \end{align*} $$ whether it is the sum of samples $\sum_{k=1}^{n} X_{k}$ or the sample mean $\overline{X}$ obtained by multiplying numerator and denominator by $n$. It’s good that, as per our intuition, the sample mean, which is an unbiased estimator, consistent estimator, and maximum likelihood estimator for $\mu$, emerges. While it makes sense mathematically, it might still feel quite abstract.
Intuitive Example
$$ X_{1} , \cdots , X_{n} \sim U (0,\theta) \text{ with } f \left( x ; \theta \right) = \begin{cases} 1 & , \text{if } x \in (0,\theta) \\ 0 & , \text{otherwise} \end{cases} = {{ 1 } \over { \theta }} I_{(0,\theta)} (x) $$
Consider, for example, a random sample from a uniform distribution $U(a,b)$ with maximum parameter $\theta$. If the actual realization is $$ \begin{bmatrix}2.3 \\ 1.2 \\ 1.7 \\ 0.1 \\ 1.1\end{bmatrix} $$ and no more samples can be obtained, since the true mean of the uniform distribution $U(a,b)$ is ${{ b+a } \over { 2 }}$, we could consider the following estimator: $$ {{ \hat{\theta} + 0 } \over { 2 }} = {{ \sum_{k} x_{k} } \over { n }} \implies \hat{\theta} \overset{?}{=} {{ 2 \sum_{k} x_{k} } \over { n }} $$ It seems like a statistically not too bad guess. In fact, multiplying the calculated sample mean by $2$ gives $2.16$, which seems plausible. However, considering $2.3$ is in the sample, it’s impossible for $\theta = 2.16$. No matter what, $\theta$ should be at least as big as $2.3$, and intuitively, a rational estimate for $\theta$ would simply be $\hat{\theta} = 2.3$. There’s no reason to think it’s bigger than $2.3$ based on the current sample. Let’s now actually find the sufficient statistic.
Product of Indicator Functions: $$ \prod_{i=1}^{n} I_{(-\infty, \theta]} \left( x_{i} \right) = I_{(-\infty, \theta]} \left( \max_{i \in [n]} x_{i} \right) $$
Given this lemma and the factorization theorem, the sufficient statistic for $\theta$ is $$ \begin{align*} \prod_{k=1}^{n} f \left( x_{k} ; \mu \right) =& \prod_{k=1}^{n} {{ 1 } \over { \theta }} I_{(0,\theta)} \left( x_{k} \right) \\ = & {{ 1 } \over { \theta^{n} }} I_{(0,\theta)} \left( \max x_{k} \right) \cdot 1 \\ = & k_{1} \left[ u_{1} \left( x_{1} , \cdots , x_{n} \right) ; \theta \right] k_{2} \left( x_{1} , \cdots , x_{n} \right) \end{align*} $$ thus the maximum value of the sample $\max_{k} X_{k} = X_{(n)}$ can be sufficient. This simply means when considering information about $\theta$, other samples are unnecessary, and considering only $\max_{k} X_{k}$ is ‘sufficient’.
This idea diverges entirely from the mindset of extracting a lot of data to estimate parameters and approximating them elsewhere. It’s statistical inference approached by mathematics and formality against what seems to be mere intuitive speculation, allowing us to enter a deeper world of statistics.
Minimal Sufficient Statistic
In the Intuitive Example, we’ve intuitively confirmed that $\max_{k} X_{k}$ is a sufficient statistic for $\theta$. It seems there can’t be a better sufficient statistic than this, which leads us to the discussion on Minimal Sufficient Statistic.