Stochastic Gradient Descent 📂Optimization

Stochastic Gradient Descent

Definition

The method known as Stochastic Gradient Descent refers to, given the objective function $Q$ , learning rate $\alpha > 0$ , batch size $m$ , and for the $i$ th data, $\omega_{n+1} := \omega_{n} - \alpha {{ 1 } \over { n }} \sum_{i=1}^{m} \nabla Q_{i} ( \omega_{n} )$ .

Explanation

Machine Learning

Stochastic Gradient Descent is inevitably deeply related to machine learning as it deals with data. Even if some terms are not familiar, it’s beneficial to try understanding through examples:

Suppose we have a problem where given $x_{1}$ and $x_{2}$ , the task is to predict $y$ . The model in consideration might be the linear combination $y = ax_{1} + bx_{2}$ . Imagine our dataset is enormous, containing about 30 trillion observations, but the computer that needs to process this data is relatively underpowered. So, let’s say we solve the problem bit by bit by dividing it. $\begin{matrix} y & x_{1} & x_{2} \\ 5.2 & 2.0 & 0.9 \\ 2.1 & 1.2 & 0.0 \\ 9.0 & 3.8 & 1.3 \\ 6.7 & 3.7 & -0.5 \end{matrix}$ For example, if we process the data by dividing it into $4$ chunks, then the batch size becomes $m=4$ . This is just an example, and the problem is fairly simple; one can easily see that $a \approx 2, b \approx 1$ is the correct answer just by looking. It’s a typical regression analysis issue, so it could be solved just using the least squares method, but let’s think about using an optimization technique to help understand.

An optimization technique begins with an objective function. Here, the objective function $Q$ can be defined as follows, and our goal is to find $\omega = (a, b)$ that minimizes the objective function. $Q (a,b) := \left( ax_{1} + bx_{2} - y \right)^2$ Here, let us remember again that the variable of our interest is $a,b$ . Finding $a,b$ based on the given data means considering four functions simultaneously for the $i=1,\cdots , m$ th data, as follows. $Q_{1} (a,b) := \left( 2.0a + 0.9b - 5.2 \right)^2 \\ Q_{2} (a,b) := \left( 1.2a + 0.0b - 2.1 \right)^2 \\ Q_{3} (a,b) := \left( 3.8a + 1.3b - 9.0 \right)^2 \\ Q_{4} (a,b) := \left( 3.7a - 0.5b - 6.7 \right)^2$ Now, the motivation behind naming it ‘stochastic’ gradient descent has emerged. Even without probabilistic concepts, it’s called stochastic gradient descent because the randomly selected data is often random. Updating with the average of derivatives regarding this is, undoubtedly, common sense.

Gradient Descent?

Stochastic Gradient Descent is mathematically no different from the usual Gradient Descent. The essential difference is the presence or absence of data, and it’s vital to remember that optimization techniques developed from stochastic gradient descent also care about data.