Convolutional Layer
Definition
Let $\mathbf{W}$ be $k \times k$ matrix. Define $M^{n\times n} = M^{n\times n}(\mathbb{R})$ to be the set of real matrices of size $n \times n$. A convolutional layer $C_{\mathbf{W}} : M^{nn} \to M^{(n-k+1) \times (n-k+1)}$ is a function defined as follows. For $\mathbf{X} \in M^{n\times n}$ and $\mathbf{Y} = C_{\mathbf{W}}(\mathbf{X})$,
$$ \begin{align*} Y_{ij} &= \begin{bmatrix} w_{11} & w_{12} & \cdots & w_{1k} \\ w_{21} & w_{22} & \cdots & w_{2k} \\ \vdots & \vdots & \ddots & \vdots \\ w_{k1} & w_{k2} & \cdots & w_{kk} \end{bmatrix} \odot \begin{bmatrix} X_{i, j} & X_{i, j+1} & \cdots & X_{i, j+k-1} \\ X_{i+1, j} & X_{i+1, j+1} & \cdots & X_{i+1, j+k-1} \\ \vdots & \vdots & \ddots & \vdots \\ X_{i+k-1, j} & X_{i+k-1, j+1} & \cdots & X_{i+k-1, j+k-1} \end{bmatrix} \\ &= \sum_{q=1}^{k} \sum_{r=1}^{k} W_{qr} X_{i+q-1, j+r-1} \end{align*} $$
$X_{ij}$ is the element of $\mathbf{X}$ at row $i$ and column $j$.
Explanation
The convolutional layer is a mapping that sends a given $\mathbf{W}$ to $\mathbf{X}$ via the 2D discrete convolution $\mathbf{W} \ast \mathbf{X}$. There are various terms for $\mathbf{W}$, such as kernel, filter, and window. In the above definition, it is defined as a square matrix, but it is acceptable to generalize $\mathbf{W}$ as a $k_{1} \times k_{2}$ matrix and $M^{n \times n}$ as $M^{n_{1} \times n_{2}}$. The process of calculating the output from the convolutional layer is shown in the GIF below.1
A function that combines a convolutional layer with an activation function is called a Convolutional Neural Network . CNNs usually show excellent performance in tasks related to images. In the case of the MLP, the values are sent through a fully connected layer at each layer, leading to an enormous number of parameters when the data dimension is large and the network is deep. In contrast, in a convolutional layer, the number of parameters depends only on the size of the kernel, independently of the input data size, which allows for a significant reduction in the number of parameters compared to linear layers.
Historically, it was proposed by mimicking how the optic nerve operates in the brain.
Stride
For a convolutional layer $C_{\mathbf{W}}$ given with a kernel of size $\mathbf{W} \in \mathbf{R}^{k \times k}$, if $\mathbf{Y} = C_{\mathbf{W}}$ is defined as follows, then the ordered pairs $(s_{1}$ and $s_{2})$ are called stride. For $\mathbf{X} \in M^{n \times n}$,
$$ Y_{ij} = \sum_{q=1}^{k} \sum_{r=1}^{k} W_{qr} X_{s_{1}(i-1)+q, s_{2}(j-1)+r} $$
The range of $i$ and $j$ is,
$$ i= 1, 2, \dots, \left\lfloor \frac{n-(k-1)}{s_{1}} \right\rfloor + 1, \quad j=1, 2, \dots, \left\lfloor \frac{n-(k-1)}{s_{2}} \right\rfloor + 1 $$
$\left\lfloor \cdot \right\rfloor$ is the floor function.
Intuitively, convolution involves moving the kernel one step at a time and performing a dot product on the overlapping areas with data $\mathbf{X}$. However, there is no rule that the kernel must move one step at a time. Thus, stride refers to the number of steps the kernel moves at once. Unless otherwise specified, $\text{stride} = (1, 1)$ is typically the default, and it usually is in the code as well.
If the kernel size is $k_{1} \times k_{2}$, the size of the input matrix is $n_{1} \times n_{2}$, and the stride is $(s_{1}, s_{2})$, then the size of the output matrix of the convolutional layer is as follows.
$$ \left( \left\lfloor \frac{n_{1} - (k_{1}-1)}{s_{1}} \right\rfloor + 1 \right) \times \left( \left\lfloor \frac{n_{2} - (k_{2}-1)}{s_{2}} \right\rfloor + 1 \right) $$
Padding
For the ordered pair $(p_{1}, p_{2})$, the following function, or the ordered pair itself, is called padding.
$$ \begin{align*} \operatorname{padding} : M^{n \times n}(\mathbb{R}) &\to M^{n+2p_{1} \times m+2p_{2}}(\mathbb{R}) \\ \mathbf{X} &\mapsto \begin{bmatrix} O_{p_{1} \times p_{2}} & O_{p_{1} \times m} & O_{p_{1} \times p_{2}} \\ O_{n \times p_{2}} & \mathbf{X} & O_{n \times p_{2}} \\ O_{p_{1} \times p_{2}} & O_{p_{1} \times m} & O_{p_{1} \times p_{2}} \end{bmatrix} \end{align*} $$
The above form is a block matrix, and $O$ is a zero matrix. Simply put, it means adding values around the top and bottom of the matrix. Padding is added because the dimension of the codomain $M^{n-k+1 \times n-k+1}$ of a convolutional layer is smaller than the dimension of the domain $M^{n \times n}$. This means that if an image is repeatedly input into the convolutional layer, the size of the image becomes gradually smaller. Padding can prevent this. Adding padding to a convolutional layer implies applying padding between the input $\mathbf{X}$ and the convolution $C_{\mathbf{W}}$.
$$ C_{\mathbf{W}} \circ \operatorname{padding} (\mathbf{X}) $$
Thus, if you enlarge $\mathbf{X}$ in advance, you can maintain the original size even after it decreases by passing through $C_{\mathbf{W}}$. If $k$ is odd, padding with $p = (k-1)/2$ retains the size of the input matrix.
Above, padding was defined as adding $0$ to the matrix’s top, bottom, left, and right, but the padding value does not have to be $0$. In PyTorch, various padding methods are implemented. Particularly, padding with $0$ is called zero padding.
If the kernel size is $k_{1} \times k_{2}$, the input matrix size is $n_{1} \times n_{2}$, the stride is $(s_{1}, s_{2})$, and the padding is $(p_{1}, p_{2})$, then the size of the output matrix of convolutional layer $C_{\mathbf{W}} \circ \operatorname{padding}$ is as follows.
$$ \left( \left\lfloor \frac{(n_{1} + 2p_{1}) - (k_{1}-1)}{s_{1}} \right\rfloor + 1 \right) \times \left( \left\lfloor \frac{(n_{2} - 2p_{2}) - (k_{2}-1)}{s_{2}} \right\rfloor + 1 \right) $$
Channel
The kernel might also be a tensor instead of a matrix. When $\mathbf{W}$ is a tensor of size $k \times k \times c$, $c$ is referred to as the channel of $\mathbf{W}$. If the input matrix is $n \times n$ in size and the kernel is size $k \times k \times c$, then $\mathbf{Y} = C_{\mathbf{W}}(\mathbf{X})$ is size $(n-k+1) \times (n-k+1) \times c$. The function output is calculated as follows.
$$ Y_{ij\ell} = \sum_{q=1}^{k} \sum_{r=1}^{k} W_{qr\ell} X_{i+q-1, j+r-1} $$
Here, the range of $i$, $j$, $\ell$ is when the stride is $(s_{1}, s_{2})$,
$$ i= 1, 2, \dots, \left\lfloor \frac{n-(k-1)}{s_{1}} \right\rfloor + 1, \quad j=1, 2, \dots, \left\lfloor \frac{n-(k-1)}{s_{2}} \right\rfloor + 1 \\[1em] \ell = 1, \dots, c $$