Implementing MLP in Julia Flux and Optimizing with Gradient Descent
MLP Implementation
First, let’s load the machine learning package in Julia, Flux.jl, and the optimizer update method update!.
using Flux
using Flux: update!
We can use the Dense() function for linear layers. The Chain() function stacks these linear layers, similar to the Sequential() feature in Keras and PyTorch.
julia> model = Chain(
Dense(10, 5, relu),
Dense(5, 5, relu),
Dense(5, 2)
)
Chain(
Dense(10, 5, relu), # 55 parameters
Dense(5, 5, relu), # 30 parameters
Dense(5, 2), # 12 parameters
) # Total: 6 arrays, 97 parameters, 772 bytes.
Now let’s create two arbitrary vectors.
x, y = randn(Float32,10), randn(Float32,2) # Dummy data
What we want is for the model to take input $\mathbf{x}$ and produce output $\mathbf{y}$. In other words, the following equation should hold:
$$ \begin{equation} \mathbf{y} = \text{model}(\mathbf{x}) \end{equation} $$
Currently, the model’s weights are initialized independently of this, so naturally, it can’t take $\mathbf{x}$ as input and produce $\mathbf{y}$ as output. Let’s define a loss function to check.
julia> loss(x,y) = sum((model(x) .- y).^2)
loss (generic function with 1 method)
julia> loss(x,y)
2.1870441f0
Optimization through Gradient Descent1
Now, we will modify the model’s parameters through gradient descent so that $(1)$ holds true. Let’s set the model’s parameters as $\theta$ and the learning rate as $\eta$.
julia> θ = Flux.params(model) #Parameters of model
Params([Float32[-0.29360774 0.19699441 … 0.14506716 0.0025551221; -0.49602875 -0.16636823 … 0.5430107 -0.6045276; … ; -0.29023308 -0.092979304 … -0.32037085 0.5427146; -0.2713689 0.17020355 … 0.31819585 -0.15343323], Float32[0.0, 0.0, 0.0, 0.0, 0.0], Float32[0.7170611 -0.029388033 … -0.74007404 -0.6452301; 0.4532911 -0.020822287 … 0.13228391 -0.2621395; … ; -0.16696058 -0.3569812 … 0.50665516 0.68468684; 0.19336864 -0.7220591 … 0.66947246 0.5213879], Float32[0.0, 0.0, 0.0, 0.0, 0.0], Float32[-0.57012194 -0.12291523 … -0.22337069 -0.54895186; 0.45517293 0.5325932 … -0.16550031 -0.15918007], Float32[0.0, 0.0]])
julia> η = 0.1 # Learning Rate
0.1
We can calculate gradients with gradient() and modify parameters with the update! method. If we optimize until the loss is smaller than $10^{-7}$, it will look like this.
julia> i=1
1
julia> @time while loss(x,y) > 0.0000001
grads = gradient(() -> loss(x, y), θ)
for param in θ
update!(param, η * grads[param])
end
print("epoch_",i, " loss=",loss(x,y), "\n")
i+=1
end
epoch_1 loss=0.2048448
epoch_2 loss=0.5493529
.
.
.
epoch_90 loss=1.0927481e-7
epoch_91 loss=9.224256e-8
0.136134 seconds (150.85 k allocations: 8.681 MiB, 66.42% compilation time)
The following code does the same thing as above.
opt = Descent(0.1) # Gradient descent with learning rate 0.1
i=1
@time while loss(x,y) > 0.0000001
grads = gradient(() -> loss(x, y), θ)
for param in θ
update!(opt, param, grads[param])
end
print("epoch_",i, " loss=",loss(x,y), "\n")
i+=1
end
In practice, comparing the values of $\mathbf{y}$ and $\text{model}(\mathbf{x})$, there’s hardly any difference.
julia> y
2-element Vector{Float32}:
0.8913109
-1.4473413
julia> model(x)
2-element Vector{Float32}:
0.891474
-1.4475975
Full Code
## Load package
using Flux
using Flux: update!
## Define model as MLP
model = Chain(
Dense(10, 5, relu),
Dense(5, 5, relu),
Dense(5, 2)
)
# Create dummy data and label
x, y = randn(Float32,10), randn(Float32,2)
# Define loss function as MSE
loss(x,y) = sum((model(x) .- y).^2)
# Get parameters of model and set learning rate
θ = Flux.params(model) #Parameters of model
η = 0.1
# Learning using gradient descent
i=1
@time while loss(x,y) > 0.0000001
grads = gradient(() -> loss(x, y), θ)
for param in θ
update!(param, η * grads[param])
end
print("epoch_",i, " loss=",loss(x,y), "\n")
i+=1
end
Environment
- OS: Windows10
- Version: Julia 1.6.2, Flux 0.12.8
