logo

Fisher Information 📂Mathematical Statistics

Fisher Information

Buildup

Score Function

Consider a random variable XX whose probability density function is f(x;θ)f \left( x ; \theta \right) for a parameter θΘ\theta \in \Theta. The estimator that maximizes the log-likelihood function, known as the maximum likelihood estimator, can be found as θ^\widehat{\theta}, which satisfies the following partial differential equation. k=1nlogf(xk;θ)θ=0 \sum_{k=1}^{n} {{ \partial \log f \left( x_{k} ; \theta \right) } \over { \partial \theta }} = 0 Here, logf(x;θ)θ\displaystyle {{ \partial \log f ( x ; \theta ) } \over { \partial \theta }} is called the Score Function. Thinking about the mathematical meaning of partial differentiation, the fact that their values are close to 00 suggests that the impact of the change in θ\theta on the likelihood function is negligible. Conversely, E[logf(X;θ)θ] E \left[ {{ \partial \log f ( X ; \theta ) } \over { \partial \theta }} \right] computing this allows us to specifically determine the influence of the given θ\theta. Considering the log-likelihood function ll, which is calculated by addition instead of the multiplication-used likelihood function LL, is also intended for this reason, as the addition calculation naturally connects to the concept of mean―expected values. However, since the score function values are a mix of positive and negative, they are inappropriate for calculating ’the amount of change’. To overcome this, consider squaring. I(θ)=?E[(logf(X;θ)θ)2] I(\theta) \overset{?}{=} E \left[ \left( {{ \partial \log f ( X ; \theta ) } \over { \partial \theta }} \right)^{2} \right] Given the discussions so far, I(θ)I ( \theta ) can also be seen, as previously explained, as how much information the given θ\theta provides. Now let’s see if such a form can be mathematically derived independently of the context of the maximum likelihood estimation method.

Theorem

Bartlett’s Identity 1

Let’s derive the Bartlett’s Identity named after the British statistician Bartlett.

Bartlett’s Identity: Given that the regularity conditions (R0)~(R4) are met,

  • [1] First identity: E[logf(X;θ)θ]=0 E \left[ {{ \partial \log f ( X ; \theta ) } \over { \partial \theta }} \right] = 0
  • [2] Second identity: E[2logf(X;θ)θ2]+Var(logf(X;θ)θ)=0 E \left[ {{ \partial^{2} \log f ( X ; \theta ) } \over { \partial \theta^{2} }} \right] + \Var \left( {{ \partial \log f ( X ; \theta ) } \over { \partial \theta }} \right) = 0

Regularity conditions:

  • (R0): The probability density function ff is injective with respect to θ\theta. Formally, it satisfies the following. θθ    f(xk;θ)f(xk;θ) \theta \ne \theta ' \implies f \left( x_{k} ; \theta \right) \ne f \left( x_{k} ; \theta ' \right)
  • (R1): The probability density function ff has the same support for all θ\theta.
  • (R2): The true value θ0\theta_{0} is an interior point of Ω\Omega.
  • (R3): The probability density function ff is twice differentiable with respect to θ\theta.
  • (R4): The integral f(x;θ)dx\int f (x; \theta) dx is twice differentiable with respect to θ\theta across the integral sign.

Derivation

[1]

1=f(x;θ)dx 1 = \int_{-\infty}^{\infty} f (x ; \theta ) dx According to the regularity conditions (R3), (R4), differentiating with respect to θ\theta yields 0=f(X;θ)θdx 0 = \int_{-\infty}^{\infty} {{ \partial f ( X ; \theta ) } \over { \partial \theta }} dx Using the trick of multiplying 1=f(x;θ)f(x;θ)\displaystyle 1 = {{ f (x ; \theta ) } \over { f (x ; \theta ) }} inside the integral, and viewing ff as a function of θ\theta, since the differentiation of the logarithm gives logg=gg\displaystyle \log g = {{ g' } \over { g }}, 0=f(X;θ)θf(x;θ)f(x;θ)dx=fθ(x;θ)1f(x;θ)f(x;θ)dx=f(x;θ)f(x;θ)f(x;θ)dx=logf(X;θ)θf(x;θ)dx=E[logf(X;θ)θ] \begin{align*} 0 =& \int_{-\infty}^{\infty} {{ \partial f ( X ; \theta ) } \over { \partial \theta }} {{ f (x ; \theta ) } \over { f (x ; \theta ) }} dx \\ =& \int_{-\infty}^{\infty} f_{\theta}(x ; \theta) {{ 1 } \over { f (x ; \theta ) }} f (x ; \theta ) dx \\ =& \int_{-\infty}^{\infty} {{ f ' (x ; \theta) } \over { f (x ; \theta ) }} f (x ; \theta ) dx \\ =& \int_{-\infty}^{\infty} {{ \partial \log f ( X ; \theta ) } \over { \partial \theta }} f (x ; \theta ) dx \\ =& E \left[ {{ \partial \log f ( X ; \theta ) } \over { \partial \theta }} \right] \end{align*}

[2]

0=logf(X;θ)θf(x;θ)dx 0 = \int_{-\infty}^{\infty} {{ \partial \log f ( X ; \theta ) } \over { \partial \theta }} f (x ; \theta ) dx

During the derivation of [1], differentiating again with respect to θ\theta using the same trick as product differentiation,

0=θ0=θ(logf(X;θ)θf(x;θ))dx=2logf(X;θ)θ2f(x;θ)dx+(logf(X;θ)θ)2f(x;θ)dx=E[2logf(X;θ)θ2]+E[(logf(X;θ)θ)2] \begin{align*} 0 =& {{ \partial } \over { \partial \theta }} 0 \\ =& \int_{-\infty}^{\infty} {{ \partial } \over { \partial \theta }} \left( {{ \partial \log f ( X ; \theta ) } \over { \partial \theta }} f (x ; \theta ) \right) dx \\ =& \int_{-\infty}^{\infty} {{ \partial^{2} \log f ( X ; \theta ) } \over { \partial \theta^{2} }} f (x ; \theta ) dx + \int_{-\infty}^{\infty} \left( {{ \partial \log f ( X ; \theta ) } \over { \partial \theta }} \right)^{2} f (x ; \theta ) dx \\ =& E \left[ {{ \partial^{2} \log f ( X ; \theta ) } \over { \partial \theta^{2} }} \right] + E \left[ \left( {{ \partial \log f ( X ; \theta ) } \over { \partial \theta }} \right)^{2} \right] \end{align*} According to identity [1], subtracting 02=E[logf(X;θ)θ]20^{2} = E \left[ {{ \partial \log f ( X ; \theta ) } \over { \partial \theta }} \right]^{2} from both sides yields VarY=EY2(EY)2\Var Y = E Y^{2} - \left( EY \right)^{2}, hence 002=E[2logf(X;θ)θ2]+E[(logf(X;θ)θ)2]E[logf(X;θ)θ]2=E[2logf(X;θ)θ2]+Var(logf(X;θ)θ) \begin{align*} 0 - 0^{2} =& E \left[ {{ \partial^{2} \log f ( X ; \theta ) } \over { \partial \theta^{2} }} \right] + E \left[ \left( {{ \partial \log f ( X ; \theta ) } \over { \partial \theta }} \right)^{2} \right] - E \left[ {{ \partial \log f ( X ; \theta ) } \over { \partial \theta }} \right]^{2} \\ =& E \left[ {{ \partial^{2} \log f ( X ; \theta ) } \over { \partial \theta^{2} }} \right] + \Var \left( {{ \partial \log f ( X ; \theta ) } \over { \partial \theta }} \right) \end{align*}

Definition 2

The Fisher information I(θ)I(\theta) is defined as follows, as the expected value of the square of the score function. I(θ):=E[(logf(X;θ)θ)2] I(\theta) := E \left[ \left( {{ \partial \log f ( X ; \theta ) } \over { \partial \theta }} \right)^{2} \right]

Corollary

According to Bartlett’s identity, the following holds: I(θ)=E[(logf(X;θ)θ)2]02=Var(logf(X;θ)θ)=E[2logf(X;θ)θ2] \begin{align*} \\ I ( \theta ) =& E \left[ \left( {{ \partial \log f ( X ; \theta ) } \over { \partial \theta }} \right)^{2} \right] - 0^{2} \\ =& \Var \left( {{ \partial \log f ( X ; \theta ) } \over { \partial \theta }} \right) \\ =& - E \left[ {{ \partial^{2} \log f ( X ; \theta ) } \over { \partial \theta^{2} }} \right] \end{align*} Accordingly, the information for a random sample X1,,XnX_{1}, \cdots , X_{n} is Var(logL(θ;X)θ)=Var(k=1nlogf(Xk;θ)θ)=nI(θ) \Var \left( {{ \partial \log L ( \theta ; \mathbf{X} ) } \over { \partial \theta }} \right) = \Var \left( \sum_{k=1}^{n} {{ \partial \log f ( X_{k} ; \theta ) } \over { \partial \theta }} \right) = n I (\theta) In other words, the information also increases as much as the number of samples.

Explanation

If one can accept the content of Buildup, there should be no question as to why Fisher information is defined in such a complex way.

In the corollary, it was observed that Fisher information could also be represented as the variance of the score function, indicating that a larger Fisher information means the score function varies more widely, naturally connecting to the intuitive concept of ‘amount of information’.

Also See


  1. https://math.stackexchange.com/a/2027660/459895 ↩︎

  2. Hogg et al. (2013). Introduction to Mathematical Statistcs(7th Edition): p334. ↩︎