An optimization problem refers to finding x∗ such that the value of function f:Rn→R is minimized.
x∗=xargminf(x)
A series of algorithms used to solve an optimization problem is called an optimizer.
Explanation
In machine learning and deep learning, the function f is referred to as the loss function, and in this context, x becomes the parameters of the neural network, or the weights.
Stochastic Gradient Descent
The optimizers used in deep learning are almost all stochastic gradient descent. Let’s denote the loss function as L and the parameters as θ. Gradient descent refers to the following type of optimizer:
θi+1=θi−α∇L(θi)
Momentum Technique
The momentum technique is an optimizer that accumulates gradients from previous steps as follows:
θi+1=θi+αj=0∑iβj∇L(θi)
Nesterov Momentum
Nesterov momentum is a slight modification of the momentum technique. For p0=0,
pi+1=βpi−α∇L(θi+βpi)
θi+1=θi+pi+1
AdaGrad
AdaGrad is an optimizer that applies an adaptive learning rate. If we denote the gradient simply as ∇Li=∇L(θi),
Graduate Student Descent refers to using graduate students as an optimizer. It has been widely used since ancient times and is still actively used worldwide. The performance varies, but in terms of cost, it is known for its incredibly efficient and cost-effective optimization performance.
Monte Carlo
Monte Carlo is about trying as many random attempts as possible.
Grid Search
Grid Search means, as its name suggests, dividing the Euclidean space Rn into a grid and trying many points to find the optimal solution.