Let W be k×kmatrix. Define Mn×n=Mn×n(R) to be the set of real matrices of size n×n. A convolutional layerCW:Mnn→M(n−k+1)×(n−k+1) is a function defined as follows. For X∈Mn×n and Y=CW(X),
The convolutional layer is a mapping that sends a given W to X via the 2D discrete convolutionW∗X. There are various terms for W, such as kernel, filter, and window. In the above definition, it is defined as a square matrix, but it is acceptable to generalize W as a k1×k2 matrix and Mn×n as Mn1×n2. The process of calculating the output from the convolutional layer is shown in the GIF below.1
A function that combines a convolutional layer with an activation function is called a Convolutional Neural Network . CNNs usually show excellent performance in tasks related to images. In the case of the MLP, the values are sent through a fully connected layer at each layer, leading to an enormous number of parameters when the data dimension is large and the network is deep. In contrast, in a convolutional layer, the number of parameters depends only on the size of the kernel, independently of the input data size, which allows for a significant reduction in the number of parameters compared to linear layers.
Historically, it was proposed by mimicking how the optic nerve operates in the brain.
Stride
For a convolutional layer CW given with a kernel of size W∈Rk×k, if Y=CW is defined as follows, then the ordered pairs (s1 and s2) are called stride. For X∈Mn×n,
Intuitively, convolution involves moving the kernel one step at a time and performing a dot product on the overlapping areas with data X. However, there is no rule that the kernel must move one step at a time. Thus, stride refers to the number of steps the kernel moves at once. Unless otherwise specified, stride=(1,1) is typically the default, and it usually is in the code as well.
If the kernel size is k1×k2, the size of the input matrix is n1×n2, and the stride is (s1,s2), then the size of the output matrix of the convolutional layer is as follows.
(⌊s1n1−(k1−1)⌋+1)×(⌊s2n2−(k2−1)⌋+1)
Padding
For the ordered pair (p1,p2), the following function, or the ordered pair itself, is called padding.
The above form is a block matrix, and O is a zero matrix. Simply put, it means adding values around the top and bottom of the matrix. Padding is added because the dimension of the codomain Mn−k+1×n−k+1 of a convolutional layer is smaller than the dimension of the domain Mn×n. This means that if an image is repeatedly input into the convolutional layer, the size of the image becomes gradually smaller. Padding can prevent this. Adding padding to a convolutional layer implies applying padding between the input X and the convolution CW.
CW∘padding(X)
Thus, if you enlarge X in advance, you can maintain the original size even after it decreases by passing through CW. If k is odd, padding with p=(k−1)/2 retains the size of the input matrix.
Above, padding was defined as adding 0 to the matrix’s top, bottom, left, and right, but the padding value does not have to be 0. In PyTorch, various padding methods are implemented. Particularly, padding with 0 is called zero padding.
If the kernel size is k1×k2, the input matrix size is n1×n2, the stride is (s1,s2), and the padding is (p1,p2), then the size of the output matrix of convolutional layer CW∘padding is as follows.
The kernel might also be a tensor instead of a matrix. When W is a tensor of size k×k×c, c is referred to as the channel of W. If the input matrix is n×n in size and the kernel is size k×k×c, then Y=CW(X) is size (n−k+1)×(n−k+1)×c. The function output is calculated as follows.
Yijℓ=q=1∑kr=1∑kWqrℓXi+q−1,j+r−1
Here, the range of i, j, ℓ is when the stride is (s1,s2),