Implementing MLP in Julia Flux and Optimizing with Gradient Descent 📂Machine Learning

Implementing MLP in Julia Flux and Optimizing with Gradient Descent

MLP Implementation

First, let’s load the machine learning package in Julia, Flux.jl, and the optimizer update method update!.

using Flux
using Flux: update!

We can use the Dense() function for linear layers. The Chain() function stacks these linear layers, similar to the Sequential() feature in Keras and PyTorch.

julia> model = Chain(
           Dense(10, 5, relu),
           Dense(5, 5, relu),
           Dense(5, 2)
           )
Chain(
  Dense(10, 5, relu),                   # 55 parameters
  Dense(5, 5, relu),                    # 30 parameters
  Dense(5, 2),                          # 12 parameters
)                   # Total: 6 arrays, 97 parameters, 772 bytes.

Now let’s create two arbitrary vectors.

x, y = randn(Float32,10), randn(Float32,2) # Dummy data

What we want is for the model to take input $\mathbf{x}$ and produce output $\mathbf{y}$ . In other words, the following equation should hold:

$\begin{equation} \mathbf{y} = \text{model}(\mathbf{x}) \end{equation}$

Currently, the model’s weights are initialized independently of this, so naturally, it can’t take $\mathbf{x}$ as input and produce $\mathbf{y}$ as output. Let’s define a loss function to check.

julia> loss(x,y) = sum((model(x) .- y).^2)
loss (generic function with 1 method)

julia> loss(x,y)
2.1870441f0

Optimization through Gradient Descent¹

Now, we will modify the model’s parameters through gradient descent so that $(1)$ holds true. Let’s set the model’s parameters as $\theta$ and the learning rate as $\eta$ .

julia> θ = Flux.params(model) #Parameters of model
Params([Float32[-0.29360774 0.19699441 … 0.14506716 0.0025551221; -0.49602875 -0.16636823 … 0.5430107 -0.6045276; … ; -0.29023308 -0.092979304 … -0.32037085 0.5427146; -0.2713689 0.17020355 … 0.31819585 -0.15343323], Float32[0.0, 0.0, 0.0, 0.0, 0.0], Float32[0.7170611 -0.029388033 … -0.74007404 -0.6452301; 0.4532911 -0.020822287 … 0.13228391 -0.2621395; … ; -0.16696058 -0.3569812 … 0.50665516 0.68468684; 0.19336864 -0.7220591 … 0.66947246 0.5213879], Float32[0.0, 0.0, 0.0, 0.0, 0.0], Float32[-0.57012194 -0.12291523 … -0.22337069 -0.54895186; 0.45517293 0.5325932 … -0.16550031 -0.15918007], Float32[0.0, 0.0]])

julia> η = 0.1 # Learning Rate
0.1

We can calculate gradients with gradient() and modify parameters with the update! method. If we optimize until the loss is smaller than $10^{-7}$ , it will look like this.

julia> i=1
1

julia> @time while loss(x,y) > 0.0000001
           grads = gradient(() -> loss(x, y), θ)

           for param in θ
               update!(param, η * grads[param])
           end

           print("epoch_",i, " loss=",loss(x,y), "\n")
           i+=1
       end
epoch_1 loss=0.2048448
epoch_2 loss=0.5493529
.
.
.
epoch_90 loss=1.0927481e-7
epoch_91 loss=9.224256e-8
  0.136134 seconds (150.85 k allocations: 8.681 MiB, 66.42% compilation time)

The following code does the same thing as above.

opt = Descent(0.1) # Gradient descent with learning rate 0.1

i=1
@time while loss(x,y) > 0.0000001
    grads = gradient(() -> loss(x, y), θ)
    for param in θ
        update!(opt, param, grads[param])
    end
    print("epoch_",i, " loss=",loss(x,y), "\n")
    i+=1
end

In practice, comparing the values of $\mathbf{y}$ and $\text{model}(\mathbf{x})$ , there’s hardly any difference.

julia> y
2-element Vector{Float32}:
  0.8913109
 -1.4473413

julia> model(x)
2-element Vector{Float32}:
  0.891474
 -1.4475975

Full Code

## Load package
using Flux
using Flux: update!

## Define model as MLP
model = Chain(
    Dense(10, 5, relu),
    Dense(5, 5, relu),
    Dense(5, 2)
    )

# Create dummy data and label
x, y = randn(Float32,10), randn(Float32,2)

# Define loss function as MSE
loss(x,y) = sum((model(x) .- y).^2)

# Get parameters of model and set learning rate
θ = Flux.params(model) #Parameters of model
η = 0.1

# Learning using gradient descent
i=1
@time while loss(x,y) > 0.0000001
    grads = gradient(() -> loss(x, y), θ)
    for param in θ
        update!(param, η * grads[param])
    end
    print("epoch_",i, " loss=",loss(x,y), "\n")
    i+=1
end

Environment

OS: Windows10
Version: Julia 1.6.2, Flux 0.12.8

https://fluxml.ai/Flux.jl/stable/training/optimisers/#Optimisers ↩︎