Reservoir Computing
Definition 1
In the context of time series forecasting, given at time $t$ an $M$-dimensional input vector $\mathbf{u} (t) \in \mathbb{R}^{M}$, we wish to predict an $P$-dimensional output vector $\mathbf{s} (t) \in \mathbb{R}^{P}$. To this end, introduce the reservoir dynamics $\mathbf{r} (t) \in \mathbb{R}^{N}$ and predict future output vectors by a linear transformation of it; this method is called reservoir computing.

Assume that the data up to $0 < t \leq T$ are divided into uniform time intervals $\Delta t$, yielding $K$ data points $\left\{ \left( \mathbf{u} \left( k \Delta t \right) , \mathbf{s} \left( k \Delta t \right) \right) \right\}_{k=1}^{K}$. The predicted output vector $\hat{ \mathbf{s} }$ after $T$ is computed as follows. $$ \hat{ \mathbf{s} } (t + \Delta t) = W_{\text{out}} \left[ \left( 1 - \alpha \right) \mathbf{r} (t) + \alpha \tanh \left( A \mathbf{r} (t) + W_{\text{in}} \mathbf{u} (t) \right) \right] $$ Here, except for $W_{\text{out}}$, the following are treated as hyperparameters:
- $\alpha \in [0, 1]$: leakage rate
- $A \in \mathbb{R}^{N \times N}$: adjacency matrix of the reservoir layer
- $W_{\text{in}} \in \mathbb{R}^{N \times M}$: input weight matrix
- $\tanh : \mathbb{R}^{N} \to \mathbb{R}^{N}$: vectorized hyperbolic tangent
Explanation
Reservoir computing is a machine learning technique widely used in the study of dynamical systems. It is useful when past data are available but future observations will be limited.
From the algebraic form of $\hat{ \mathbf{s} }$, given the input vector at time $t$, $\mathbf{u} (t)$, one can predict the output vector at time $t + \delta t$, $\hat{ \mathbf{s} } (t + \Delta t)$, via the previously learned $W_{\text{out}}$. The main objective of reservoir computing can be summarized as finding an approximation $f$ of a function that maps $\mathbf{u}$ to $\mathbf{s}$. $$ \mathbf{s} \left( t + \Delta t \right) = f \left( \mathbf{u} (t) \right) $$

Rössler attractor:
$$ \begin{align*} {{dx} \over {dt}} =& - y - z \\ {{dy} \over {dt}} =& x + ay \\ {{dz} \over {dt}} =& b + (x-c) z \end{align*} $$
The figure shows a reservoir computer trained on the Rössler attractor successfully predicting component $y, z$ while observing only component $x$. In this case, $\mathbf{u} = (x)$ is a $M=1$-dimensional vector, and $\mathbf{s} = (y,z)$ is a $P=2$-dimensional vector.
Implementation
Reservoir dynamics
First, the number of nodes in the reservoir layer $N \in \mathbb{N}$ must be specified. $\mathbf{r} \in \mathbb{R}^{N}$ forms a dynamical system in the form of a map, and can be thought of as playing the role of a latent space. $$ \mathbf{r} \left( t + \Delta t \right) = \left( 1 - \alpha \right) \mathbf{r} (t) + \alpha \tanh \left( A \mathbf{r} (t) + W_{\text{in}} \mathbf{u} (t) + \xi \mathbf{1} \right) $$ The initial state $\mathbf{r} (0)$ can be chosen randomly, and $\alpha \in [0, 1]$ is tuned depending on how strongly the input vector $\mathbf{u}$ should influence the reservoir. $\tanh$ serves to introduce nonlinearity into the formula before applying the least-squares method; functionally it is similar to an activation function and need not be exactly the hyperbolic tangent. $\xi$ is the magnitude of the bias term, and $\mathbf{1}$ is the one-vector; since the bias term is not essential in the matrix-algebraic presentation, it was omitted in the most condensed formula above.
Reservoir layer
The interactions that occur recursively as the $N$-dimensional $\mathbf{r}$ evolves to the next time step are governed by the reservoir network. $A$ is the adjacency matrix of an Erdős–Rényi network, and its random network is sampled so that the average degree equals $D$. For $A$, non-diagonal entries are drawn iid from the uniform distribution $U[-1, 1]$, and then $A$ is scaled by a scalar so that the magnitude of its largest eigenvalue, i.e., its spectral radius, equals $\rho > 0$.
This way of constructing the reservoir computer with a random network is called an echo state network2. As shown, $\mathbf{r}$ reflects the idea of a recurrent neural network (RNN) because the network elements influence each other as given.
Input weight matrix
$W_{\text{in}} \in \mathbb{R}^{N \times M}$ takes only one of the $M$ components of the input vector $\mathbf{u} (t) \in \mathbb{R}^{M}$ and passes it to a reservoir node. Intuitively, each row of $W_{\text{in}}$ should have exactly one non-zero element, and conversely each of the $N$ reservoir nodes should receive exactly one input component. Which input component each node receives is chosen at random, and the corresponding value is also sampled randomly from the uniform distribution $U[-\sigma, \sigma]$. The input weight scale $\sigma > 0$ is a hyperparameter; as shown, the structure of $W_{\text{in}}$ can be chosen rather arbitrarily if there is some reasonable justification.
Output weight matrix
Up to now we have expressed $\mathbf{r}$ as a function of $\mathbf{u}$. What we ultimately want is to express $\mathbf{s}$ as a function of $\mathbf{r}$ so that, when $\mathbf{u}$ is fed in, $\mathbf{s}$ is produced. The predicted vector $\hat{ \mathbf{s} }$ of $\mathbf{s}$ is expressed as follows. $$ \hat{\mathbf{s}} (t) = W_{\text{out}} \mathbf{r} (t) + \mathbf{c} $$ Here $\mathbf{c}$ is a bias term analogous to $\xi \mathbf{1}$ mentioned earlier; it was omitted in the compact formula since it is not essential. $W_{\text{out}}$ and $\mathbf{c}$ are not hyperparameters but parameters that must be computed, i.e., learned.
First, for fixed hyperparameters, compute all $\mathbf{r} (t)$ from $\mathbf{u} (t)$, and then define the following matrix $R, S$. $$ \begin{align*} \overline{\mathbf{r}} =& {\frac{ 1 }{ K }} \sum_{k=1}^{K} \mathbf{r} (k \Delta t) \\ \overline{\mathbf{s}} =& {\frac{ 1 }{ K }} \sum_{k=1}^{K} \mathbf{s} (k \Delta t) \\ \left( R \right)_{:, k} =& \mathbf{r} (k \Delta t) - \overline{\mathbf{r}} \\ \left( S \right)_{:, k} =& \mathbf{s} (k \Delta t) - \overline{\mathbf{s}} \end{align*} $$ Here $\left( X \right)_{:, j}$ denotes the $j$-th column of the matrix. In simple terms, $R$ and $S$ are design matrices. This completes the buildup for applying the least-squares method.
Closed-form solution of ridge regression:
$$ L \left( \beta \right) = \left\| Y - X \beta \right\|_{2}^{2} + \lambda \left\| \beta \right\|_{2}^{2} $$
If $\lambda$ is given as a constant, write the ridge regression objective function $L$ as above. The optimal solution $\hat{\beta} = \argmin_{\beta} L \left( \beta \right)$ of ridge regression is: $$ \hat{\beta} = \left( X^{T} X + \lambda I \right)^{-1} X^{T} Y $$ Here $A^{T}$ denotes the transpose of $A$, $I$ is the identity matrix, and $A^{-1}$ is the inverse of $A$.
If ridge regression is used as regularization to prevent overfitting, then for regularization parameter $\lambda \geq 0$, $W_{\text{out}}^{\ast}$ is computed as follows. $$ W_{\text{out}}^{\ast} = S R^{T} \left( R R^{T} + \lambda I \right)^{-1} $$ The reason the order of multiplication in the cited formula appears reversed is that the design matrix was defined so that it becomes horizontally longer with the number of training samples, and a transpose has already been applied. $\mathbf{c}^{\ast}$ is computed as follows, and can be interpreted as an average-type drift. $$ \mathbf{c}^{\ast} = \overline{\mathbf{s}} - W_{\text{out}}^{\ast} \overline{\mathbf{r}} $$
As a result, one might think this is simply multivariate nonlinear regression — and indeed it is. The implementation-faithful formula without omissions is: $$ \begin{align*} \hat{ \mathbf{s} } (t + \Delta t) =& W_{\text{out}}^{\ast} \mathbf{r} (t + \Delta t) + \mathbf{c}^{\ast} \\ =& W_{\text{out}}^{\ast} \left[ \left( 1 - \alpha \right) \mathbf{r} (t) + \alpha \tanh \left( A \mathbf{r} (t) + W_{\text{in}} \mathbf{u} (t) + \xi \mathbf{1} \right) \right] + \mathbf{c}^{\ast} \\ =& f \left( \mathbf{u} (t) \right) \end{align*} $$
When inspecting the trained model, one can see it is overwhelmingly lightweight compared to deep learning methods that use gradient descent.
Standardization issues
The authors observed that the method performs well when the data are standardized, but may fail when they are not.

This is the performance of a reservoir computer trained on the Lorenz attractor. The top panel shows the unstandardized case, and the bottom panel shows the standardized case.
Zhixin Lu, Jaideep Pathak, Brian Hunt, Michelle Girvan, Roger Brockett, Edward Ott; Reservoir observers: Model-free inference of unmeasured variables in chaotic systems. Chaos 1 April 2017; 27 (4): 041102. https://doi.org/10.1063/1.4979665 ↩︎
Herbert Jaeger, Harald Haas ,Harnessing Nonlinearity: Predicting Chaotic Systems and Saving Energy in Wireless Communication.Science304,78-80(2004).DOI: https://doi.org/10.1126/science.1091277 ↩︎
