How to Use GPU in Julia Flux 📂Machine Learning

How to Use GPU in Julia Flux

Overview

In this article, we introduce how to implement deep learning with Flux.jl,¹ a machine learning library for Julia, and how to accelerate the learning performance through the GPU.

To use the GPU, it is essential to utilize CUDA.jl² and to have the CUDA settings properly configured in advance. The setup for CUDA is similar to that in Python, so refer to the following post for guidance:

How to Install TensorFlow GPU on Windows in Python

It must be stated that while deep learning with GPUs is fundamental, it also presents a significant barrier for beginners. From this perspective, the official tutorials in Julia can seem somewhat user-unfriendly, but the process for enabling GPU acceleration is actually quite intuitive and straightforward.

Code

julia> using Flux, BenchmarkTools
       optimizer = Adam()
       
       f(v) = v[1].^2 + 2v[2] - v[3]*v[4]
       n = 10_000
       
       function ftry_FFNN()
           FFNN = Chain(
               Dense( 4, 100, relu),
               Dense(100, 100, relu),
               Dense(100, 100, relu),
               Dense(100, 1),
           )
           return FFNN
       end
ftry_FFNN (generic function with 1 method)

This post uses the implementation of an MLP (a feed-forward neural network) as an example for solving the problem of finding non-linear functions.(../3227) The focus here is not immediately on the performance of this artificial neural network, but on how to use the GPU for training the network. The function ftry_FFNN() is named under the idea that it returns an FFNN like a factory.

julia> X = rand(Float32, 4, n)
4×10000 Matrix{Float32}:
 0.669836  0.260559  0.710337  0.5121    …  0.298401  0.763405  0.977941   0.89907
 0.135182  0.938298  0.110935  0.534417     0.804302  0.197353  0.784419   0.179469
 0.287801  0.839834  0.718759  0.15976      0.442106  0.696369  0.41352    0.617924
 0.213884  0.345568  0.426718  0.961369     0.542308  0.378965  0.0633196  0.580489

julia> Y = f.(eachcol(X))'
1×10000 adjoint(::Vector{Float32}) with eltype Float32:
 0.657488  1.65427  0.419741  1.17749  …  1.45789  0.713595  2.49902  0.808568

Data is generated randomly.

CPU

julia> data = Flux.DataLoader((X, Y), shuffle = true, batchsize = 1000);

julia> cFFNN = ftry_FFNN()
Chain(
  Dense(4 => 100, relu),                # 500 parameters
  Dense(100 => 100, relu),              # 10_100 parameters
  Dense(100 => 100, relu),              # 10_100 parameters
  Dense(100 => 1),                      # 101 parameters
)                   # Total: 8 arrays, 20_801 parameters, 81.754 KiB.

julia> Loss(x,y) = Flux.mse(cFFNN(x), y)
Loss (generic function with 1 method)

julia> ps = Flux.params(cFFNN);

julia> @btime Flux.train!(Loss, ps, data, optimizer)
  22.027 ms (3413 allocations: 60.07 MiB)

One can observe the CPU utilization increasing as the epochs proceed.

GPU

julia> using CUDA

julia> CUDA.functional()
true

First, load CUDA.jl and verify if CUDA can operate properly in the environment by calling CUDA.functional(). If it outputs true, it means CUDA is operational. Then, only two steps are required.

First, upload the data to the GPU:

julia> X = X |> gpu
4×10000 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 0.490089  0.305455  0.334825  0.0466784  …  0.243163  0.732825    0.401764  0.361136      
 0.102807  0.755049  0.971202  0.522919      0.766326  0.498049    0.669154  0.546359      
 0.440698  0.777547  0.263636  0.448606      0.854045  0.465477    0.314271  0.854085      
 0.288417  0.932674  0.808397  0.180423      0.63759   0.00347775  0.565556  0.872233      

julia> Y = Y |> gpu
1×10000 adjoint(::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}) with eltype Float32:
 0.318696  0.878203  1.84139  0.967079  0.162922  …  1.04725  1.53151  1.32198  0.478175   

julia> data = Flux.DataLoader((X, Y), shuffle = true, batchsize = 1000);

Second, upload the function to the GPU:

julia> gFFNN = ftry_FFNN() |> gpu
Chain(
  Dense(4 => 100, relu),                # 500 parameters
  Dense(100 => 100, relu),              # 10_100 parameters
  Dense(100 => 100, relu),              # 10_100 parameters
  Dense(100 => 1),                      # 101 parameters
)                   # Total: 8 arrays, 20_801 parameters, 1.109 KiB.

And just like with the CPU, proceed with the training as usual:

julia> Loss(x,y) = Flux.mse(gFFNN(x), y)
Loss (generic function with 1 method)

julia> ps = Flux.params(gFFNN);

julia> @btime Flux.train!(Loss, ps, data, optimizer)
  8.588 ms (24313 allocations: 1.90 MiB)

In summary, deep learning with GPUs can be boiled down to these three lines:

X |> gpu
Y |> gpu
FFNN |> gpu

For beginners using the GPU, the most crucial aspect is to constantly monitor the consumption of hardware resources to ensure that the device is indeed being utilized and that performance has not slowed down compared to CPU processing. Incorrect CUDA settings or unfamiliarity with GPU programming might result in only the CPU being used, leading one to mistakenly think everything is functioning correctly.

Performance Comparison

julia> @btime Flux.train!(Loss, ps, data, optimizer) # CPU
  23.120 ms (3413 allocations: 60.07 MiB)

julia> @btime Flux.train!(Loss, ps, data, optimizer) # GPU
  8.427 ms (24313 allocations: 1.90 MiB)

With the GPU, a roughly threefold increase in speed is observed compared to the CPU, indicating at least that using the GPU did not result in a loss. However, in actual environments, the difference is often more significant, especially as the size of the data and the architecture grow. In my research, I experienced a performance difference of more than 50 times with the neural networks I used.

Conversely, in research where the emphasis is not on ‘performance’ but rather on the ‘implementation’ itself, such as when complex functions or indexing are needed in the neural network, there’s a possibility that using the GPU could be disadvantageous. In these cases, it might be better to stick with the CPU or to employ various tricks when designing the neural network to ensure it runs smoothly on the GPU. Configuring layers for optimal GPU performance greatly depends on the individual’s skill.

Full Code

using Flux

optimizer = Adam()

f(v) = v[1].^2 + 2v[2] - v[3]*v[4]
n = 10_000

function ftry_FFNN()
    FFNN = Chain(
        Dense( 4, 100, relu),
        Dense(100, 100, relu),
        Dense(100, 100, relu),
        Dense(100, 1),
    )
    return FFNN
end

X = rand(Float32, 4, n)
Y = f.(eachcol(X))'

data = Flux.DataLoader((X, Y), shuffle = true, batchsize = 1000);

cFFNN = ftry_FFNN()

Loss(x,y) = Flux.mse(cFFNN(x), y)
ps = Flux.params(cFFNN);
@btime Flux.train!(Loss, ps, data, optimizer)

# --------------------

using CUDA

CUDA.functional()
X = X |> gpu
Y = Y |> gpu
data = Flux.DataLoader((X, Y), shuffle = true, batchsize = 1000);
gFFNN = ftry_FFNN() |> gpu

Loss(x,y) = Flux.mse(gFFNN(x), y)
ps = Flux.params(gFFNN);
@btime Flux.train!(Loss, ps, data, optimizer)

using BenchmarkTools

@btime Flux.train!(Loss, ps, data, optimizer) # CPU
@btime Flux.train!(Loss, ps, data, optimizer) # GPU

Environment

OS: Windows
julia: v1.9.0
Flux v0.13.16