Implementation of Zomorodian's Algorithm 📂Topological Data Analysis

Implementation of Zomorodian's Algorithm

Overview

This section explains and implements the pseudocode of an algorithm introduced in the paper “Computing Persistent Homology” by Zomorodian and Carlsson¹. It takes a filtered complex constructed from an abstract simplicial complex and returns $\mathcal{P}$ -intervals, omitting the construction of computationally challenging persistent modules and calculating persistent homology through matrix reduction. Furthermore, the actual implementation does not even use matrix operations.

Derivation

Derivation of Zomorodian’s algorithm: It’s certain that without understanding the theoretical aspects of the algorithm, one could not comprehend the algorithm just by looking at the pseudocode. Even if one does not understand it perfectly, one should study enough to grasp why matrices suddenly disappear and why marking is necessary before moving on to implementation.

Algorithm

Let’s assume we have received a filtered complex as data before the algorithm operates.

Let’s create a dictionary or table $T$ like the one above to store data and record information from the algorithm. For example, in a Julia DataFrame, you would copy the numbers in the data to epsilon, the alphabetic part to simplex, and add a boolean column marked to store the marking status, a slot to store the chains of the chain complex, and a column J to store integers that come out during the calculation.

Julia arrays start at $1$ , not $0$ , and there are parts where indices are not considered for convenience of implementation, so there are cases where numbers are all wrong by $1~2$ like the screenshot above, but it’s not important at all, so don’t worry about it.
Note that the slot for storing chains can freely switch between representations as a set and as a chain. For example, the calculation is on $\mathbb{Z}_{2}$ so $a + (a - b) = 2 a - b = b \pmod{2}$ such calculation takes place, which is also the same as the following set operation $\left\{ a \right\} \cup \left\{ a, -b \right\} = \left\{ b \right\}$ In algebraic operations, the element $0$ is treated as ’nonexistent’ in the set and should be accepted as such. It might be unpleasant for those who like precise and meticulous notation, but it’s not entirely unreasonable, so let’s just move on.
Also, the original paper derives the algorithm over a general field $F$ , meaning every $q \in F$ has an inverse $q^{-1} \in F$ , so calculations like $d = d - q^{-1} T[i]$ are performed, but in this implementation, the binary field $\mathbb{Z}_{2}$ is sufficient, so $q^{-1}$ is not calculated separately and is replaced for $\Delta$ defined as $A \Delta B := \left( A \cup B \right) \setminus \left( A \cap B \right)$ like the following. $d = d \Delta T[i]$
The expression $\deg$ appears too frequently as a (program) function, an index, and the degree of a polynomial function, causing confusion. Therefore, in $T$ , it’s denoted as epsilon instead of deg. In fact, in simple level topological data analysis, the value of that column, the radius $\varepsilon > 0$ , usually increases to construct a filtered complex.
$T$ assumes a perfect alignment according to the dimensions of the simplices, and accordingly, epsilon is expected to have a partial order.

Pseudocode

$\left\{ L_{k} \right\}$ = COMPUTEINTERVALS $(K)$

Input: Receives a filtered complex $K$ . The filtered complex must have at least two pieces of information: at what timing $\deg \in \mathbb{Z}$ a certain simplex $\sigma$ was added.
Output: Obtains a set $\left\{ L_{k} \right\}_{k=0}^{\dim K}$ of sets $L_{k}$ of $\mathcal{P}$ -intervals for $k = 0, \cdots , \dim K$ .
Side Effect: Modifies marked in table $T$ where data is recorded.

$d$ = REMOVEPIVOTROWS $(\sigma)$

Input: Receives a $k$ -dimensional simplex $\sigma$ .
Output: Obtains an element of some $\mathsf{C}_{k-1}$ , which is a $(k-1)$ -dimensional chain, meaning it’s operated on by $k$ -dimensional simplices.

$i$ = maxindex $d$

Input: Receives the chain $d$ .
Output: Returns the largest index $i$ among all simplex included in chain $d$ in table $T$ . For example, for maxindex(abc), it should return the largest $9$ among $5$ of ab, $6$ of bc, and $9$ of ac.

$k$ = dim $d$

Input: Receives a $k$ -dimensional chain $d$ .
Output: Returns the integer $k$ .

$k$ = dim $\sigma$

Input: Receives a $k$ -dimensional simplex $\sigma$ .
Output: Returns the integer $k$ .

$k$ = deg $(\sigma)$

Input: Receives the simplex $\sigma$ .
Output: Returns the integer epsilon corresponding to simplex $\sigma$ in table $T$ . For example, if deg(cd), it should return $2$ since epsilon of cd is $2$ .

Keywords

Mark is used in the form Mark $\sigma$ to change the marked of the corresponding simplex $\sigma$ to true.
Store is used in the form Store $j$ and $d$ in $T[i]$ to store the integer $j$ in J of $T[i]$ and the chain $d$ in slot.
Remove is used in the form Remove $x$ in $d$ to remove the term $x$ in the chain $d$ .

$\sigma^{i}$ is the simplex at the $i$ th position in table $T$ , and $m$ is the length of $T$ .

function COMPUTEINTERVALS $(K)$
　　# Initialization
　　for $k \in 0:\dim K$
　　　　 $L_{k} := \emptyset$
　　end for

　　for $j \in 0:(m-1)$
　　　　 $d$ = REMOVEPIVOTROWS $\left( \sigma^{j} \right)$
　　　　if $d = \emptyset$
　　　　　　# $d$ being empty means it’s a candidate for a (non-pivot) zero column
　　　　　　mark $\sigma^{j}$
　　　　else
　　　　　　# Must calculate the degree of $d$ , so it must be the max of all terms
　　　　　　# $d$ is a chain one dimension lower than $\sigma^{j}$ and must be $i < j$
　　　　　　 $i$ = maxindex $d$
　　　　　　 $k$ = dim $d$
　　　　　　 $L_{k}$ = $L_{k} \cup \left\{ \left( \deg \sigma^{i}, \deg \sigma^{j} \right) \right\}$
　　　　end if
　　end for

　　for $j \in 0:(m-1)$
　　　　# If it’s still not marked, it’s definitely a zero column
　　　　if $\sigma^{j}$ ismarked and $T[j]$ isempty
　　　　　　 $k$ = dim $d$
　　　　　　# Corresponds to $H_{k-1}$ to $\sum^{\hat{e}_{i}} F[t]$ , handling infinity
　　　　　　 $L_{k}$ = $L_{k} \cup \left\{ \left( \deg \sigma^{i}, \infty \right) \right\}$
　　　　end if
　　end for
　　return $\left\{ L_{k} \right\}$
end function

function REMOVEPIVOTROWS $(\sigma)$
　　 $k$ = $\dim \sigma$
　　# Assuming $\partial abc = ab - bc + ca$ , ∂("abc") = ["ab", "bc", "ca"]
　　 $d$ = $\partial_{k} \sigma$
　　Remove not marked $(k-1)$ -dimensional simplex in $d$
　　while $d \ne \emptyset$
　　　　 $i$ = maxindex $d$
　　　　if $T[i]$ isempty
　　　　　　break
　　　　end if
　　　　# Since it’s $\mathbb{Z}_{2}$ , replace with symdiff (symmetric difference)
　　　　 $d$ = $d \Delta T[i]$
　　end while
　　return $d$
end function

Implementation

The result of the algorithm for the given example should be as follows: $\begin{align*} L_{0} =& \left\{ \left( 0, \infty \right) , \left( 0,1 \right) , \left( 1,1 \right) , \left( 1,2 \right) \right\} \\ L_{1} =& \left\{ \left( 2,5 \right) , \left( 3,4 \right) \right\} \end{align*}$

The result of the implementation in Julia is as follows. Except for the parts where the indices are precisely incorrect, it can be seen that it has been implemented correctly.

Full Code

As you can see, the code is written almost exactly following the notation of the original paper. For instance, the Julia-like code for $L_{k} = L_{k} \cup \left\{ \left( \deg \sigma^{i}, \deg \sigma^{j} \right) \right\}$ is push!(L_[k], (deg(σⁱ), deg(σʲ))), but to make it look almost identical to the paper, it’s implemented as L_[k] = L_[k] ∪ [(deg(σⁱ), deg(σʲ))].

using DataFrames

data = DataFrame([
0 "a"
0 "b"
1 "c"
1 "d"
1 "ab"
1 "bc"
2 "cd"
2 "ad"
3 "ac"
4 "abc"
5 "acd"
], ["epsilon", "simplex"])
T = copy(data)
T[!, :"marked"] .= false
T[!, :"slot"] .= [[]]
T[!, :"J"] .= 0

dimK = 2
m = nrow(T)

test_σ = "abc"
dim(σ) = length(σ)

function deg(σ)
    return T.epsilon[findfirst(T.simplex .== σ)]
end
deg(test_σ)

function ∂(σ)
    k = dim(σ)
    return [σ[(1:k)[Not(t)]] for t = 1:k]
end
∂(test_σ)

function maxindex(chain)
    return (T.simplex .∈ Ref(chain)) |> findall |> maximum
end
maxindex(∂(test_σ))

function REMOVEPIVOTROWS(σ)
    k = dim(σ); d = ∂(σ)
    d = d[d .∈ Ref(T[T.marked,:simplex])] # Remove unmarked terms in ▷eq029◁
    while !(d |> isempty)
        i = maxindex(d)
        if T[i,:slot] |> isempty break end
        d = symdiff(d, T[i,:slot])
        # print("d in empty")
    end
    return d
end
REMOVEPIVOTROWS(test_σ)

L_ = [[] for k = 0:dimK]
for j0 = 0:(m-1)
    j = j0+1
    σʲ = T[j,:simplex]
    d = REMOVEPIVOTROWS(σʲ)
    if d |> isempty
        T[j,:marked] = true
    else
        i = maxindex(d); k = dim(σʲ)
        σⁱ = T[i,:simplex]
        T[i,[:J,:slot]] = j0,d
        L_[k] = L_[k] ∪ [(deg(σⁱ), deg(σʲ))]
    end
end
for j0 = 0:(m-1)
    j = j0+1
    σʲ = T[j,:simplex]
    if (T[j,:marked]) && (T[j,:slot] |> isempty) && (T[j,:J] |> iszero)
        k = dim(σʲ); L_[k] = L_[k] ∪ [(deg(σʲ), Inf)]
        print("j: $j")
    end
end

Zomorodian. (2005). Computing Persistent Homology: ch4 ↩︎