logo

Convolutional Layer 📂Machine Learning

Convolutional Layer

Definition

Let W\mathbf{W} be k×kk \times k matrix. Define Mn×n=Mn×n(R)M^{n\times n} = M^{n\times n}(\mathbb{R}) to be the set of real matrices of size n×nn \times n. A convolutional layer CW:MnnM(nk+1)×(nk+1)C_{\mathbf{W}} : M^{nn} \to M^{(n-k+1) \times (n-k+1)} is a function defined as follows. For XMn×n\mathbf{X} \in M^{n\times n} and Y=CW(X)\mathbf{Y} = C_{\mathbf{W}}(\mathbf{X}),

Yij=[w11w12w1kw21w22w2kwk1wk2wkk][Xi,jXi,j+1Xi,j+k1Xi+1,jXi+1,j+1Xi+1,j+k1Xi+k1,jXi+k1,j+1Xi+k1,j+k1]=q=1kr=1kWqrXi+q1,j+r1 \begin{align*} Y_{ij} &= \begin{bmatrix} w_{11} & w_{12} & \cdots & w_{1k} \\ w_{21} & w_{22} & \cdots & w_{2k} \\ \vdots & \vdots & \ddots & \vdots \\ w_{k1} & w_{k2} & \cdots & w_{kk} \end{bmatrix} \cdot \begin{bmatrix} X_{i, j} & X_{i, j+1} & \cdots & X_{i, j+k-1} \\ X_{i+1, j} & X_{i+1, j+1} & \cdots & X_{i+1, j+k-1} \\ \vdots & \vdots & \ddots & \vdots \\ X_{i+k-1, j} & X_{i+k-1, j+1} & \cdots & X_{i+k-1, j+k-1} \end{bmatrix} \\ &= \sum_{q=1}^{k} \sum_{r=1}^{k} W_{qr} X_{i+q-1, j+r-1} \end{align*}

XijX_{ij} is the element of X\mathbf{X} at row ii and column jj.

Explanation

The convolutional layer is a mapping that sends a given W\mathbf{W} to X\mathbf{X} via the 2D discrete convolution WX\mathbf{W} \ast \mathbf{X}. There are various terms for W\mathbf{W}, such as kernel, filter, and window. In the above definition, it is defined as a square matrix, but it is acceptable to generalize W\mathbf{W} as a k1×k2k_{1} \times k_{2} matrix and Mn×nM^{n \times n} as Mn1×n2M^{n_{1} \times n_{2}}. The process of calculating the output from the convolutional layer is shown in the GIF below.1

A function that combines a convolutional layer with an activation function is called a Convolutional Neural Network . CNNs usually show excellent performance in tasks related to images. In the case of the MLP, the values are sent through a fully connected layer at each layer, leading to an enormous number of parameters when the data dimension is large and the network is deep. In contrast, in a convolutional layer, the number of parameters depends only on the size of the kernel, independently of the input data size, which allows for a significant reduction in the number of parameters compared to linear layers.

Historically, it was proposed by mimicking how the optic nerve operates in the brain.

Stride

For a convolutional layer CWC_{\mathbf{W}} given with a kernel of size WRk×k\mathbf{W} \in \mathbf{R}^{k \times k}, if Y=CW\mathbf{Y} = C_{\mathbf{W}} is defined as follows, then the ordered pairs (s1(s_{1} and s2)s_{2}) are called stride. For XMn×n\mathbf{X} \in M^{n \times n},

Yij=q=1kr=1kWqrXs1(i1)+q,s2(j1)+r Y_{ij} = \sum_{q=1}^{k} \sum_{r=1}^{k} W_{qr} X_{s_{1}(i-1)+q, s_{2}(j-1)+r}

The range of ii and jj is,

i=1,2,,n(k1)s1+1,j=1,2,,n(k1)s2+1 i= 1, 2, \dots, \left\lfloor \frac{n-(k-1)}{s_{1}} \right\rfloor + 1, \quad j=1, 2, \dots, \left\lfloor \frac{n-(k-1)}{s_{2}} \right\rfloor + 1

\left\lfloor \cdot \right\rfloor is the floor function.

Intuitively, convolution involves moving the kernel one step at a time and performing a dot product on the overlapping areas with data X\mathbf{X}. However, there is no rule that the kernel must move one step at a time. Thus, stride refers to the number of steps the kernel moves at once. Unless otherwise specified, stride=(1,1)\text{stride} = (1, 1) is typically the default, and it usually is in the code as well.

If the kernel size is k1×k2k_{1} \times k_{2}, the size of the input matrix is n1×n2n_{1} \times n_{2}, and the stride is (s1,s2)(s_{1}, s_{2}), then the size of the output matrix of the convolutional layer is as follows.

(n1(k11)s1+1)×(n2(k21)s2+1) \left( \left\lfloor \frac{n_{1} - (k_{1}-1)}{s_{1}} \right\rfloor + 1 \right) \times \left( \left\lfloor \frac{n_{2} - (k_{2}-1)}{s_{2}} \right\rfloor + 1 \right)

Padding

For the ordered pair (p1,p2)(p_{1}, p_{2}), the following function, or the ordered pair itself, is called padding.

padding:Mn×n(R)Mn+2p1×m+2p2(R)X[Op1×p2Op1×mOp1×p2On×p2XOn×p2Op1×p2Op1×mOp1×p2] \begin{align*} \operatorname{padding} : M^{n \times n}(\mathbb{R}) &\to M^{n+2p_{1} \times m+2p_{2}}(\mathbb{R}) \\ \mathbf{X} &\mapsto \begin{bmatrix} O_{p_{1} \times p_{2}} & O_{p_{1} \times m} & O_{p_{1} \times p_{2}} \\ O_{n \times p_{2}} & \mathbf{X} & O_{n \times p_{2}} \\ O_{p_{1} \times p_{2}} & O_{p_{1} \times m} & O_{p_{1} \times p_{2}} \end{bmatrix} \end{align*}

The above form is a block matrix, and OO is a zero matrix. Simply put, it means adding values around the top and bottom of the matrix. Padding is added because the dimension of the codomain Mnk+1×nk+1M^{n-k+1 \times n-k+1} of a convolutional layer is smaller than the dimension of the domain Mn×nM^{n \times n}. This means that if an image is repeatedly input into the convolutional layer, the size of the image becomes gradually smaller. Padding can prevent this. Adding padding to a convolutional layer implies applying padding between the input X\mathbf{X} and the convolution CWC_{\mathbf{W}}.

CWpadding(X) C_{\mathbf{W}} \circ \operatorname{padding} (\mathbf{X})

Thus, if you enlarge X\mathbf{X} in advance, you can maintain the original size even after it decreases by passing through CWC_{\mathbf{W}}. If kk is odd, padding with p=(k1)/2p = (k-1)/2 retains the size of the input matrix.

Above, padding was defined as adding 00 to the matrix’s top, bottom, left, and right, but the padding value does not have to be 00. In PyTorch, various padding methods are implemented. Particularly, padding with 00 is called zero padding.

If the kernel size is k1×k2k_{1} \times k_{2}, the input matrix size is n1×n2n_{1} \times n_{2}, the stride is (s1,s2)(s_{1}, s_{2}), and the padding is (p1,p2)(p_{1}, p_{2}), then the size of the output matrix of convolutional layer CWpaddingC_{\mathbf{W}} \circ \operatorname{padding} is as follows.

((n1+2p1)(k11)s1+1)×((n22p2)(k21)s2+1) \left( \left\lfloor \frac{(n_{1} + 2p_{1}) - (k_{1}-1)}{s_{1}} \right\rfloor + 1 \right) \times \left( \left\lfloor \frac{(n_{2} - 2p_{2}) - (k_{2}-1)}{s_{2}} \right\rfloor + 1 \right)

Channel

The kernel might also be a tensor instead of a matrix. When W\mathbf{W} is a tensor of size k×k×ck \times k \times c, cc is referred to as the channel of W\mathbf{W}. If the input matrix is n×nn \times n in size and the kernel is size k×k×ck \times k \times c, then Y=CW(X)\mathbf{Y} = C_{\mathbf{W}}(\mathbf{X}) is size (nk+1)×(nk+1)×c(n-k+1) \times (n-k+1) \times c. The function output is calculated as follows.

Yij=q=1kr=1kWqrXi+q1,j+r1 Y_{ij\ell} = \sum_{q=1}^{k} \sum_{r=1}^{k} W_{qr\ell} X_{i+q-1, j+r-1}

Here, the range of ii, jj, \ell is when the stride is (s1,s2)(s_{1}, s_{2}),

i=1,2,,n(k1)s1+1,j=1,2,,n(k1)s2+1=1,,c i= 1, 2, \dots, \left\lfloor \frac{n-(k-1)}{s_{1}} \right\rfloor + 1, \quad j=1, 2, \dots, \left\lfloor \frac{n-(k-1)}{s_{2}} \right\rfloor + 1 \\[1em] \ell = 1, \dots, c