Energy-Based Model

Overview¹ ² ³

Energy-based models are a methodology where a function of data referred to as energy is defined, with the notion that data with lower energy is considered more plausible (i.e., more probable). Unlike assuming previously known distributions, energy-based models directly model the distribution of the data, making them useful for handling complex datasets.

Build-up

In traditional probabilistic models, the approach to handling data involves assuming that the data follows a well-known distribution (e.g., normal distribution, Bernoulli distribution, Poisson distribution, etc.), and estimating the parameters of that distribution. For instance, if data is assumed to follow a normal distribution, one can model the entire distribution by estimating the mean and variance. This approach is well-established in statistics, allows for mathematical interpretation, and works effectively for simple data.

However, this approach has clear limitations. In reality, the data we encounter is high-dimensional, complex, and challenging to represent in a simple form like a normal distribution. Furthermore, it is often difficult to clearly know the shape of the distribution for complexly structured data, making the assumption of a “normal distribution” unrealistic. Approaches that pre-assume a distribution and attempt to fit the data to it run the risk of not reflecting the data’s intrinsic structure.

For example, consider a person’s facial photograph. A facial image comprises numerous pixels, each represented as an integer from $0$ to $255$. The colors of these pixels are not independent; instead, they exhibit particular patterns. For instance, the positions and shapes of eyes, nose, and mouth are interrelated. Such data is challenging to express using simple probability density functions as described above. Assuming a specific probability distribution and estimating its parameters may often be driven by a strong hypothesis that the data follows such a distribution, even if the fit is not perfect. Therefore, when dealing with complex data, it’s often better to model the data’s distribution directly rather than assuming it. An example of such a method is the energy-based model.

The idea of energy-based models, as hinted by the name, is derived from physics. In a system with temperature $T$, the probability of energy $E$ is given as follows, known as the Boltzmann distribution.

$$ P(E) \propto e^{ - \frac{E}{k_{B} T} } $$

The term $k_{B}$ is the Boltzmann constant. What’s essential here is not the detail about “systems” or “temperatures” but the fact that states with lower energy have higher probability. In physical terms, stability implies a state of low energy. Applying this to data implies that data with lower energy is more likely to be sampled, meaning it’s plausible data. Thus, energy-based models define the probability density function of data as follows.

Definition

An energy-based model assumes that the probability density function of the distribution that data follows is given as follows.

$$ p_{\boldsymbol{\theta}}(\mathbf{x}) = \dfrac{1}{Z(\boldsymbol{\theta})}e^{ - E_{\boldsymbol{\theta}}(\mathbf{x}) }, \qquad Z(\boldsymbol{\theta}) = \int e^{ - E_{\boldsymbol{\theta}}(\mathbf{x}) } d\mathbf{x} $$

Here, $E_{\boldsymbol{\theta}}$ is known as the energy function. The denominator z$Z(\boldsymbol{\theta}) = \int e^{ - E_{\boldsymbol{\theta}}(\mathbf{x}) } d\mathbf{x}$ is

Explanation

$p_{\boldsymbol{\theta}}$ needs to integrate to $1$ to be a valid probability density function, hence a normalization constant $Z(\boldsymbol{\theta})$ is multiplied. The term $Z(\boldsymbol{\theta}) = \int e^{ - E_{\boldsymbol{\theta}}(\mathbf{x}) } d\mathbf{x}$ is also referred to as the partition function. The exponent may be simplified as follows.

$$ p_{\boldsymbol{\theta}}(\mathbf{x}) = \dfrac{1}{Z(\boldsymbol{\theta})} \tilde{p}(\mathbf{x}; \boldsymbol{\theta}) = \dfrac{1}{Z(\boldsymbol{\theta})} f(\mathbf{x}; \boldsymbol{\theta}) $$

Because $f$ is not normalized, such a model is called a non-normalized model.

The expressiveness of $p_{\theta}$ varies based on how the energy function is defined, allowing for effective modeling even of complex data. Despite its complexity, ultimately, it takes the form of an exponential function, making it well-suited in the context of information theory. Since information is defined as the negative log of probability, we obtain the following.

$$ I(\mathbf{x}) = - \log p_{\boldsymbol{\theta}}(\mathbf{x}) = E_{\boldsymbol{\theta}}(\mathbf{x}) + \log Z(\boldsymbol{\theta}) $$

The second term is constant with respect to $\mathbf{x}$, making the energy of data $\mathbf{x}$ in energy-based models equivalent to the information of the data.

Model
- Hopfield Network
- Boltzmann Machine
  - Restricted Boltzmann Machine
Learning, Sampling: When calculating the normalization constant $Z(\boldsymbol{\theta})$ is difficult due to issues such as the curse of dimensionality, the following sampling methods are employed.
- Markov Chain Monte Carlo (MCMC)
- Score Matching