logo

Fisher Information 📂Mathematical Statistics

Fisher Information

Buildup

Score Function

Consider a random variable $X$ whose probability density function is $f \left( x ; \theta \right)$ for a parameter $\theta \in \Theta$. The estimator that maximizes the log-likelihood function, known as the maximum likelihood estimator, can be found as $\widehat{\theta}$, which satisfies the following partial differential equation. $$ \sum_{k=1}^{n} {{ \partial \log f \left( x_{k} ; \theta \right) } \over { \partial \theta }} = 0 $$ Here, $\displaystyle {{ \partial \log f ( x ; \theta ) } \over { \partial \theta }}$ is called the Score Function. Thinking about the mathematical meaning of partial differentiation, the fact that their values are close to $0$ suggests that the impact of the change in $\theta$ on the likelihood function is negligible. Conversely, $$ E \left[ {{ \partial \log f ( X ; \theta ) } \over { \partial \theta }} \right] $$ computing this allows us to specifically determine the influence of the given $\theta$. Considering the log-likelihood function $l$, which is calculated by addition instead of the multiplication-used likelihood function $L$, is also intended for this reason, as the addition calculation naturally connects to the concept of mean―expected values. However, since the score function values are a mix of positive and negative, they are inappropriate for calculating ’the amount of change’. To overcome this, consider squaring. $$ I(\theta) \overset{?}{=} E \left[ \left( {{ \partial \log f ( X ; \theta ) } \over { \partial \theta }} \right)^{2} \right] $$ Given the discussions so far, $I ( \theta )$ can also be seen, as previously explained, as how much information the given $\theta$ provides. Now let’s see if such a form can be mathematically derived independently of the context of the maximum likelihood estimation method.

Theorem

Bartlett’s Identity 1

Let’s derive the Bartlett’s Identity named after the British statistician Bartlett.

Bartlett’s Identity: Given that the regularity conditions (R0)~(R4) are met,

  • [1] First identity: $$ E \left[ {{ \partial \log f ( X ; \theta ) } \over { \partial \theta }} \right] = 0 $$
  • [2] Second identity: $$ E \left[ {{ \partial^{2} \log f ( X ; \theta ) } \over { \partial \theta^{2} }} \right] + \operatorname{Var} \left( {{ \partial \log f ( X ; \theta ) } \over { \partial \theta }} \right) = 0 $$

Regularity conditions:

  • (R0): The probability density function $f$ is injective with respect to $\theta$. Formally, it satisfies the following. $$ \theta \ne \theta’ \implies f \left( x_{k} ; \theta \right) \ne f \left( x_{k} ; \theta’ \right) $$
  • (R1): The probability density function $f$ has the same support for all $\theta$.
  • (R2): The true value $\theta_{0}$ is an interior point of $\Omega$.
  • (R3): The probability density function $f$ is twice differentiable with respect to $\theta$.
  • (R4): The integral $\int f (x; \theta) dx$ is twice differentiable with respect to $\theta$ across the integral sign.

Derivation

[1]

$$ 1 = \int_{-\infty}^{\infty} f (x ; \theta ) dx $$ According to the regularity conditions (R3), (R4), differentiating with respect to $\theta$ yields $$ 0 = \int_{-\infty}^{\infty} {{ \partial f ( X ; \theta ) } \over { \partial \theta }} dx $$ Using the trick of multiplying $\displaystyle 1 = {{ f (x ; \theta ) } \over { f (x ; \theta ) }}$ inside the integral, and viewing $f$ as a function of $\theta$, since the differentiation of the logarithm gives $\displaystyle \log g = {{ g' } \over { g }}$, $$ \begin{align*} 0 =& \int_{-\infty}^{\infty} {{ \partial f ( X ; \theta ) } \over { \partial \theta }} {{ f (x ; \theta ) } \over { f (x ; \theta ) }} dx \\ =& \int_{-\infty}^{\infty} f_{\theta}(x ; \theta) {{ 1 } \over { f (x ; \theta ) }} f (x ; \theta ) dx \\ =& \int_{-\infty}^{\infty} {{ f ' (x ; \theta) } \over { f (x ; \theta ) }} f (x ; \theta ) dx \\ =& \int_{-\infty}^{\infty} {{ \partial \log f ( X ; \theta ) } \over { \partial \theta }} f (x ; \theta ) dx \\ =& E \left[ {{ \partial \log f ( X ; \theta ) } \over { \partial \theta }} \right] \end{align*} $$

[2]

$$ 0 = \int_{-\infty}^{\infty} {{ \partial \log f ( X ; \theta ) } \over { \partial \theta }} f (x ; \theta ) dx $$

During the derivation of [1], differentiating again with respect to $\theta$ using the same trick as product differentiation,

$$ \begin{align*} 0 =& {{ \partial } \over { \partial \theta }} 0 \\ =& \int_{-\infty}^{\infty} {{ \partial } \over { \partial \theta }} \left( {{ \partial \log f ( X ; \theta ) } \over { \partial \theta }} f (x ; \theta ) \right) dx \\ =& \int_{-\infty}^{\infty} {{ \partial^{2} \log f ( X ; \theta ) } \over { \partial \theta^{2} }} f (x ; \theta ) dx + \int_{-\infty}^{\infty} \left( {{ \partial \log f ( X ; \theta ) } \over { \partial \theta }} \right)^{2} f (x ; \theta ) dx \\ =& E \left[ {{ \partial^{2} \log f ( X ; \theta ) } \over { \partial \theta^{2} }} \right] + E \left[ \left( {{ \partial \log f ( X ; \theta ) } \over { \partial \theta }} \right)^{2} \right] \end{align*} $$ According to identity [1], subtracting $0^{2} = E \left[ {{ \partial \log f ( X ; \theta ) } \over { \partial \theta }} \right]^{2}$ from both sides yields $\operatorname{Var} Y = E Y^{2} - \left( EY \right)^{2}$, hence $$ \begin{align*} 0 - 0^{2} =& E \left[ {{ \partial^{2} \log f ( X ; \theta ) } \over { \partial \theta^{2} }} \right] + E \left[ \left( {{ \partial \log f ( X ; \theta ) } \over { \partial \theta }} \right)^{2} \right] - E \left[ {{ \partial \log f ( X ; \theta ) } \over { \partial \theta }} \right]^{2} \\ =& E \left[ {{ \partial^{2} \log f ( X ; \theta ) } \over { \partial \theta^{2} }} \right] + \operatorname{Var} \left( {{ \partial \log f ( X ; \theta ) } \over { \partial \theta }} \right) \end{align*} $$

Definition 2

The Fisher information $I(\theta)$ is defined as follows, as the expected value of the square of the score function. $$ I(\theta) := E \left[ \left( {{ \partial \log f ( X ; \theta ) } \over { \partial \theta }} \right)^{2} \right] $$

Corollary

According to Bartlett’s identity, the following holds: $$ \begin{align*} \\ I ( \theta ) =& E \left[ \left( {{ \partial \log f ( X ; \theta ) } \over { \partial \theta }} \right)^{2} \right] - 0^{2} \\ =& \operatorname{Var} \left( {{ \partial \log f ( X ; \theta ) } \over { \partial \theta }} \right) \\ =& - E \left[ {{ \partial^{2} \log f ( X ; \theta ) } \over { \partial \theta^{2} }} \right] \end{align*} $$ Accordingly, the information for a random sample $X_{1}, \cdots , X_{n}$ is $$ \operatorname{Var} \left( {{ \partial \log L ( \theta ; \mathbf{X} ) } \over { \partial \theta }} \right) = \operatorname{Var} \left( \sum_{k=1}^{n} {{ \partial \log f ( X_{k} ; \theta ) } \over { \partial \theta }} \right) = n I (\theta) $$ In other words, the information also increases as much as the number of samples.

Explanation

If one can accept the content of Buildup, there should be no question as to why Fisher information is defined in such a complex way.

In the corollary, it was observed that Fisher information could also be represented as the variance of the score function, indicating that a larger Fisher information means the score function varies more widely, naturally connecting to the intuitive concept of ‘amount of information’.

Also See


  1. https://math.stackexchange.com/a/2027660/459895 ↩︎

  2. Hogg et al. (2013). Introduction to Mathematical Statistcs(7th Edition): p334. ↩︎