logo

Optimizing Distance Matrix Calculations in Julia 📂Julia

Optimizing Distance Matrix Calculations in Julia

Conclusion

Let’s calculate the distance between $n$ coordinates.

  1. If it’s not necessary to calculate the distance between all coordinates, divide them into groups and create a rectangular distance matrix.
  2. The rectangular distance matrix can be calculated quickly and easily with the pairwise() function.

Speed Comparison

Let’s imagine doing a moving agent-based simulation for the SIR model. The original time complexity is $O \left( n^{2} \right)$, but if you divide it into $S$ and $I$ groups for calculation, the time complexity drops to $O \left( n_{S} n_{I} \right)$. Typically, the spread of diseases is implemented by calculating the distance matrix between $S$ and $I$ to judge whether they fall within a certain radius $\varepsilon$, and by how much they have made contact. Let’s compare the speeds with this in mind.

using Distances
using StatsBase
using Random

Random.seed!(0);
N = 10^4
location = rand(2, N);
state = sample(['S', 'I'], Weights([0.1, 0.9]), N);
S = location[:, state .== 'S']
I = location[:, state .== 'I']

function foo(S, I)
    contact = Int64[]
    for s in 1:996
        push!(contact, sum(sum((I .- S[:,s]).^2, dims = 1) .< 0.1))
    end
    return contact
end

@time foo(S, I)

Of course, the idea of dividing into groups to calculate improves performance significantly, not just in Julia but with any method used. The point is that there’s no need to loop through brute force when you can make good use of the Distance package’s pairwise() function.

julia> @time foo(S, I);
  0.170835 seconds (7.98 k allocations: 210.854 MiB, 12.56% gc Time)

julia> @time sum(pairwise(Euclidean(),S,I) .< 0.1, dims = 1);
  0.087875 seconds (14 allocations: 69.639 MiB, 4.15% gc Time)

These two commands do exactly the same thing, but in terms of speed, one is about twice as fast as the other, and in terms of allocation, the difference is so substantial that one might not even want to calculate it, not to mention that coding difficulty is much easier compared to looping.

Further Research 1

When the Euclidean distance is the distance function, using SqEuclidean() instead of Euclidean() omits the root taking calculation, thus speeding up the process. The following yields exactly the same results, but the speed difference is about 1.5 times.

julia> @time sum(pairwise(Euclidean(),S,I) .< 0.1, dims = 1);
  0.091917 seconds (14 allocations: 69.639 MiB, 7.60% gc Time)

julia> @time sum(pairwise(SqEuclidean(),S,I) .< 0.01, dims = 1);
  0.061776 seconds (14 allocations: 69.639 MiB, 11.37% gc Time)

Moreover, it can get even faster. At this point, simple code optimization will not be sufficient to increase speed, and a k-d tree2, a data structure favorable for multi-dimensional search, must be utilized. See how to quickly calculate distances with NearstNeighbors.jl.

Environment

  • OS: Windows
  • julia: v1.5.0