Implementing MLP in Julia Flux and Optimizing with Gradient Descent
MLP Implementation
First, let’s load the machine learning package in Julia, Flux.jl
, and the optimizer update method update!
.
using Flux
using Flux: update!
We can use the Dense()
function for linear layers. The Chain()
function stacks these linear layers, similar to the Sequential()
feature in Keras and PyTorch.
julia> model = Chain(
Dense(10, 5, relu),
Dense(5, 5, relu),
Dense(5, 2)
)
Chain(
Dense(10, 5, relu), # 55 parameters
Dense(5, 5, relu), # 30 parameters
Dense(5, 2), # 12 parameters
) # Total: 6 arrays, 97 parameters, 772 bytes.
Now let’s create two arbitrary vectors.
x, y = randn(Float32,10), randn(Float32,2) # Dummy data
What we want is for the model to take input $\mathbf{x}$ and produce output $\mathbf{y}$. In other words, the following equation should hold:
$$ \begin{equation} \mathbf{y} = \text{model}(\mathbf{x}) \end{equation} $$
Currently, the model’s weights are initialized independently of this, so naturally, it can’t take $\mathbf{x}$ as input and produce $\mathbf{y}$ as output. Let’s define a loss function to check.
julia> loss(x,y) = sum((model(x) .- y).^2)
loss (generic function with 1 method)
julia> loss(x,y)
2.1870441f0
Optimization through Gradient Descent1
Now, we will modify the model’s parameters through gradient descent so that $(1)$ holds true. Let’s set the model’s parameters as $\theta$ and the learning rate as $\eta$.
julia> θ = Flux.params(model) #Parameters of model
Params([Float32[-0.29360774 0.19699441 … 0.14506716 0.0025551221; -0.49602875 -0.16636823 … 0.5430107 -0.6045276; … ; -0.29023308 -0.092979304 … -0.32037085 0.5427146; -0.2713689 0.17020355 … 0.31819585 -0.15343323], Float32[0.0, 0.0, 0.0, 0.0, 0.0], Float32[0.7170611 -0.029388033 … -0.74007404 -0.6452301; 0.4532911 -0.020822287 … 0.13228391 -0.2621395; … ; -0.16696058 -0.3569812 … 0.50665516 0.68468684; 0.19336864 -0.7220591 … 0.66947246 0.5213879], Float32[0.0, 0.0, 0.0, 0.0, 0.0], Float32[-0.57012194 -0.12291523 … -0.22337069 -0.54895186; 0.45517293 0.5325932 … -0.16550031 -0.15918007], Float32[0.0, 0.0]])
julia> η = 0.1 # Learning Rate
0.1
We can calculate gradients with gradient()
and modify parameters with the update!
method. If we optimize until the loss is smaller than $10^{-7}$, it will look like this.
julia> i=1
1
julia> @time while loss(x,y) > 0.0000001
grads = gradient(() -> loss(x, y), θ)
for param in θ
update!(param, η * grads[param])
end
print("epoch_",i, " loss=",loss(x,y), "\n")
i+=1
end
epoch_1 loss=0.2048448
epoch_2 loss=0.5493529
.
.
.
epoch_90 loss=1.0927481e-7
epoch_91 loss=9.224256e-8
0.136134 seconds (150.85 k allocations: 8.681 MiB, 66.42% compilation time)
The following code does the same thing as above.
opt = Descent(0.1) # Gradient descent with learning rate 0.1
i=1
@time while loss(x,y) > 0.0000001
grads = gradient(() -> loss(x, y), θ)
for param in θ
update!(opt, param, grads[param])
end
print("epoch_",i, " loss=",loss(x,y), "\n")
i+=1
end
In practice, comparing the values of $\mathbf{y}$ and $\text{model}(\mathbf{x})$, there’s hardly any difference.
julia> y
2-element Vector{Float32}:
0.8913109
-1.4473413
julia> model(x)
2-element Vector{Float32}:
0.891474
-1.4475975
Full Code
## Load package
using Flux
using Flux: update!
## Define model as MLP
model = Chain(
Dense(10, 5, relu),
Dense(5, 5, relu),
Dense(5, 2)
)
# Create dummy data and label
x, y = randn(Float32,10), randn(Float32,2)
# Define loss function as MSE
loss(x,y) = sum((model(x) .- y).^2)
# Get parameters of model and set learning rate
θ = Flux.params(model) #Parameters of model
η = 0.1
# Learning using gradient descent
i=1
@time while loss(x,y) > 0.0000001
grads = gradient(() -> loss(x, y), θ)
for param in θ
update!(param, η * grads[param])
end
print("epoch_",i, " loss=",loss(x,y), "\n")
i+=1
end
Environment
- OS: Windows10
- Version: Julia 1.6.2, Flux 0.12.8