How to Use GPU in Julia Flux
Overview
In this article, we introduce how to implement deep learning with Flux.jl,1 a machine learning library for Julia, and how to accelerate the learning performance through the GPU.
To use the GPU, it is essential to utilize CUDA.jl
2 and to have the CUDA settings properly configured in advance. The setup for CUDA is similar to that in Python, so refer to the following post for guidance:
It must be stated that while deep learning with GPUs is fundamental, it also presents a significant barrier for beginners. From this perspective, the official tutorials in Julia can seem somewhat user-unfriendly, but the process for enabling GPU acceleration is actually quite intuitive and straightforward.
Code
julia> using Flux, BenchmarkTools
optimizer = Adam()
f(v) = v[1].^2 + 2v[2] - v[3]*v[4]
n = 10_000
function ftry_FFNN()
FFNN = Chain(
Dense( 4, 100, relu),
Dense(100, 100, relu),
Dense(100, 100, relu),
Dense(100, 1),
)
return FFNN
end
ftry_FFNN (generic function with 1 method)
This post uses the implementation of an MLP (a feed-forward neural network) as an example for solving the problem of finding non-linear functions.(../3227) The focus here is not immediately on the performance of this artificial neural network, but on how to use the GPU for training the network. The function ftry_FFNN()
is named under the idea that it returns an FFNN like a factory.
julia> X = rand(Float32, 4, n)
4×10000 Matrix{Float32}:
0.669836 0.260559 0.710337 0.5121 … 0.298401 0.763405 0.977941 0.89907
0.135182 0.938298 0.110935 0.534417 0.804302 0.197353 0.784419 0.179469
0.287801 0.839834 0.718759 0.15976 0.442106 0.696369 0.41352 0.617924
0.213884 0.345568 0.426718 0.961369 0.542308 0.378965 0.0633196 0.580489
julia> Y = f.(eachcol(X))'
1×10000 adjoint(::Vector{Float32}) with eltype Float32:
0.657488 1.65427 0.419741 1.17749 … 1.45789 0.713595 2.49902 0.808568
Data is generated randomly.
CPU
julia> data = Flux.DataLoader((X, Y), shuffle = true, batchsize = 1000);
julia> cFFNN = ftry_FFNN()
Chain(
Dense(4 => 100, relu), # 500 parameters
Dense(100 => 100, relu), # 10_100 parameters
Dense(100 => 100, relu), # 10_100 parameters
Dense(100 => 1), # 101 parameters
) # Total: 8 arrays, 20_801 parameters, 81.754 KiB.
julia> Loss(x,y) = Flux.mse(cFFNN(x), y)
Loss (generic function with 1 method)
julia> ps = Flux.params(cFFNN);
julia> @btime Flux.train!(Loss, ps, data, optimizer)
22.027 ms (3413 allocations: 60.07 MiB)
One can observe the CPU utilization increasing as the epochs proceed.
GPU
julia> using CUDA
julia> CUDA.functional()
true
First, load CUDA.jl
and verify if CUDA can operate properly in the environment by calling CUDA.functional()
. If it outputs true
, it means CUDA is operational. Then, only two steps are required.
First, upload the data to the GPU:
julia> X = X |> gpu
4×10000 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
0.490089 0.305455 0.334825 0.0466784 … 0.243163 0.732825 0.401764 0.361136
0.102807 0.755049 0.971202 0.522919 0.766326 0.498049 0.669154 0.546359
0.440698 0.777547 0.263636 0.448606 0.854045 0.465477 0.314271 0.854085
0.288417 0.932674 0.808397 0.180423 0.63759 0.00347775 0.565556 0.872233
julia> Y = Y |> gpu
1×10000 adjoint(::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}) with eltype Float32:
0.318696 0.878203 1.84139 0.967079 0.162922 … 1.04725 1.53151 1.32198 0.478175
julia> data = Flux.DataLoader((X, Y), shuffle = true, batchsize = 1000);
Second, upload the function to the GPU:
julia> gFFNN = ftry_FFNN() |> gpu
Chain(
Dense(4 => 100, relu), # 500 parameters
Dense(100 => 100, relu), # 10_100 parameters
Dense(100 => 100, relu), # 10_100 parameters
Dense(100 => 1), # 101 parameters
) # Total: 8 arrays, 20_801 parameters, 1.109 KiB.
And just like with the CPU, proceed with the training as usual:
julia> Loss(x,y) = Flux.mse(gFFNN(x), y)
Loss (generic function with 1 method)
julia> ps = Flux.params(gFFNN);
julia> @btime Flux.train!(Loss, ps, data, optimizer)
8.588 ms (24313 allocations: 1.90 MiB)
In summary, deep learning with GPUs can be boiled down to these three lines:
X |> gpu
Y |> gpu
FFNN |> gpu
For beginners using the GPU, the most crucial aspect is to constantly monitor the consumption of hardware resources to ensure that the device is indeed being utilized and that performance has not slowed down compared to CPU processing. Incorrect CUDA settings or unfamiliarity with GPU programming might result in only the CPU being used, leading one to mistakenly think everything is functioning correctly.
Performance Comparison
julia> @btime Flux.train!(Loss, ps, data, optimizer) # CPU
23.120 ms (3413 allocations: 60.07 MiB)
julia> @btime Flux.train!(Loss, ps, data, optimizer) # GPU
8.427 ms (24313 allocations: 1.90 MiB)
With the GPU, a roughly threefold increase in speed is observed compared to the CPU, indicating at least that using the GPU did not result in a loss. However, in actual environments, the difference is often more significant, especially as the size of the data and the architecture grow. In my research, I experienced a performance difference of more than 50 times with the neural networks I used.
Conversely, in research where the emphasis is not on ‘performance’ but rather on the ‘implementation’ itself, such as when complex functions or indexing are needed in the neural network, there’s a possibility that using the GPU could be disadvantageous. In these cases, it might be better to stick with the CPU or to employ various tricks when designing the neural network to ensure it runs smoothly on the GPU. Configuring layers for optimal GPU performance greatly depends on the individual’s skill.
Full Code
using Flux
optimizer = Adam()
f(v) = v[1].^2 + 2v[2] - v[3]*v[4]
n = 10_000
function ftry_FFNN()
FFNN = Chain(
Dense( 4, 100, relu),
Dense(100, 100, relu),
Dense(100, 100, relu),
Dense(100, 1),
)
return FFNN
end
X = rand(Float32, 4, n)
Y = f.(eachcol(X))'
data = Flux.DataLoader((X, Y), shuffle = true, batchsize = 1000);
cFFNN = ftry_FFNN()
Loss(x,y) = Flux.mse(cFFNN(x), y)
ps = Flux.params(cFFNN);
@btime Flux.train!(Loss, ps, data, optimizer)
# --------------------
using CUDA
CUDA.functional()
X = X |> gpu
Y = Y |> gpu
data = Flux.DataLoader((X, Y), shuffle = true, batchsize = 1000);
gFFNN = ftry_FFNN() |> gpu
Loss(x,y) = Flux.mse(gFFNN(x), y)
ps = Flux.params(gFFNN);
@btime Flux.train!(Loss, ps, data, optimizer)
using BenchmarkTools
@btime Flux.train!(Loss, ps, data, optimizer) # CPU
@btime Flux.train!(Loss, ps, data, optimizer) # GPU
Environment
- OS: Windows
- julia: v1.9.0
- Flux v0.13.16