Paper Review: Kolmogorov-Arnold Neural Network (KAN)
Overview and Summary
Kolmogorov–Arnold Networks (KAN) are neural networks inspired by the Kolmogorov–Arnold representation theorem. Despite the idea having been discussed for decades, authors like Girosi and Poggio pointed out in the 1989 paper ‘Representation Properties of Networks: Kolmogorov’s Theorem Is Irrelevant’1 that there is not much direct correlation between the theorem and neural networks. Nonetheless, naming these networks after mathematicians emphasizes their strong theoretical foundation. The paper, submitted to ArXiv2 in April ‘24, has already been cited nearly 200 times by September 27th, ‘24, which indicates some degree of success.
The concept behind KAN is solid: to rival the modern and dominant MLP in AI by breaking the limitations of the black-box nature of existing MLPs and deep learning. Clear understanding and transparency are crucial, especially for applications in finance, healthcare, space, and military, where research has already shown quite novel approaches.
However, it’s necessary to be cautious about claims related to model performance and learning speed, as some parts of the mathematical formalization seem somewhat haphazard.
0. Abstract
KAN draws inspiration from the Kolmogorov-Arnold representation theorem and differs from traditional MLPs, which use predefined activation functions to update weights at nodes. Instead, KAN aims to modify the activation function itself through learning while updating weights on links. The core theme of KAN is interpretability, suggesting that in several cases proposed by the authors, KAN could collaborate with mathematicians and physicists and prove to be a promising alternative to MLPs.
1. Introduction
The major criticism of MLPs is their lack of interpretability. While MLPs developed theoretically on the Universal Approximation Theorem, KAN is based on the Kolmogorov-Arnold representation theorem. MLPs simulate the brain by applying nonlinear functions to linear combinations of data, and although the computations can be complex, GPUs can process these tasks using matrix operations, driving significant advancements. Conversely, KAN’s basic setup involves a $1$-dimensional function, more specifically learning a model through splines. This naturally raises concerns like ‘Wouldn’t the computations take too long?’ and ‘Can’t we use GPUs?’, but typically, KAN requires much smaller network sizes compared to MLPs for similar performance.
Efforts to build neural networks using the Kolmogorov-Arnold representation theorem previously struggled because modern techniques couldn’t easily adhere to the rigorously specified conditions of the theorem. The authors introduce structural innovations to address this, though they admit that KAN essentially combines splines and MLP concepts.
Despite repeated claims from the authors that KAN overcomes the curse of dimensionality, this point remains contentious. KAN showcases various problems and appeals to mathematicians and scientists, but it’s still hard to see it dethrone the engineering and industrial dominance of existing neural network models, suggesting its primary focus remains basic science. The code for this research is available on the GitHub repository https://github.com/KindXiaoming/pykan.
2. Kolmogorov–Arnold Networks (KAN)
Reemphasizing the relationship between MLPs and the Universal Approximation Theorem, and KANs and the Kolmogorov-Arnold representation theorem.
2.1 Kolmogorov-Arnold Representation Theorem
The Kolmogorov-Arnold representation theorem states that a continuous multivariable function on a bounded domain can be expressed as a composition and addition binary operation of a finite number of continuous single-variable functions. Specifically, a smooth function $f : [0, 1]^{n} \to \mathbb{R}$ can be represented as $$ f \left( x_{1} , \cdots , x_{n} \right) = \sum_{q=1}^{2n+1} \Phi_{q} \left( \sum_{p=1}^{n} \phi_{q,p} \left( x_{p} \right) \right) $$ where $\phi_{q,p} : [0, 1] \to \mathbb{R}$ and $\Phi_{q} : \mathbb{R} \to \mathbb{R}$ hold. The issue with this theorem is that it only asserts existence without constraining the functions $\phi_{q,p}$ and $\Phi_{q}$, making it practically irrelevant for discovering such functions in machine learning—a point the authors acknowledge pessimistically.
Nevertheless, the authors view this theorem optimistically, willing to adapt it without obsessing over exact structures. They note that in both scientific and ordinary contexts, smooth and sparse structures are often encountered and can work well.
2.2 KAN Architecture
Here, several notations are introduced. The structure of KAN is represented by an array of integers: $$ \left[ n_{0} , n_{1} , \cdots , n_{L} \right] $$ $n_{l}$ denotes the number of nodes in the $l$-th layer, and this KAN includes $\left( L+1 \right)$ layers including input and output layers. The $i$-th node in the $l$-th layer is denoted as $\left( l, i \right)$, and its value is represented as $x_{l,i}$. There are exactly $n_{l} n_{l+1}$ links (activation functions) between the $l$-th and $(l+1)$-th layers, and the activation function for a link connecting neurons $(l,i)$ and $\left( l+1, j \right)$ is denoted as $\phi_{l,j,i}$. The index ranges are naturally defined by $l = 0, \cdots, L-1$, front layer $i = 1 , \cdots , n_{l}$, and back layer $j = 1 , \cdots , n_{l+1}$.
The value of a back layer node $x_{l+1, j}$ is given by $$ x_{l+1, j} = \sum_{i=1}^{n_{l}} \phi_{l,j,i} \left( x_{l, i} \right) $$ and these vectors are represented in bold notation as $\mathbf{x}_{l} = \left( x_{l, 1} , \cdots , x_{l, n_{l}} \right)$. To better understand this process in matrix form: $$ \begin{align*} & \mathbf{x}_{l+1} \\ =& \Phi_{l} \left( \mathbf{x}_{l+1} \right) \\ =& \begin{bmatrix} \phi_{l,1,1} (\cdot) & \phi_{l,1,2} (\cdot) & \cdots & \phi_{l,1,n_{l}} (\cdot) \\ \phi_{l,2,1} (\cdot) & \phi_{l,2,2} (\cdot) & \cdots & \phi_{l,2,n_{l}} (\cdot) \\ \vdots & \vdots & \ddots & \vdots \\ \phi_{l,n_{l+1},1} (\cdot) & \phi_{l,n_{l+1},2} (\cdot) & \cdots & \phi_{l,n_{l+1},n_{l}} (\cdot) \end{bmatrix} \mathbf{x}_{l} \\ =& \begin{bmatrix} \phi_{l,1,1} \left( x_{l, 1} \right) + \phi_{l,1,2} \left( x_{l, 2} \right) + \cdots + \phi_{l,1,n_{l}} \left( x_{l, n_{l}} \right) \\ \phi_{l,2,1} \left( x_{l, 1} \right) + \phi_{l,2,2} \left( x_{l, 2} \right) + \cdots + \phi_{l,2,n_{l}} \left( x_{l, n_{l}} \right) \\ \vdots \\ \phi_{l,n_{l+1},1} \left( x_{l, 1} \right) + \phi_{l,n_{l+1},2} \left( x_{l, 2} \right) + \cdots + \phi_{l,n_{l+1},n_{l}} \left( x_{l, n_{l}} \right) \end{bmatrix} \\ =& \begin{bmatrix} \sum_{i=1}^{n_{l}} \phi_{l,1,i} \left( x_{l, i} \right) \\ \sum_{i=1}^{n_{l}} \phi_{l,2,i} \left( x_{l, i} \right) \\ \vdots \\ \sum_{i=1}^{n_{l}} \phi_{l,n_{l+1},i} \left( x_{l, i} \right) \end{bmatrix} \end{align*} $$ This is not mathematically rigorous and should be approached with caution. It mirrors matrix operations because it involves summing functions $\phi$ over each vector component, not actual matrix multiplication of $\Phi_{l}$ and $\mathbf{x}_{l}$.
This allows KAN to be expressed as a composite function of $\Phi_{l}$ such that: $$ \operatorname{KAN} \left( \mathbf{x} \right) = \left( \Phi_{L-1} \circ \cdots \circ \Phi_{1} \circ \Phi_{0} \right) \left( \mathbf{x} \right) $$ Comparable to MLPs represented as: $$ \operatorname{MLP} \left( \mathbf{x} \right) = \left( W_{L-1} \circ \sigma \circ \cdots \circ \sigma \circ W_{1} \circ \sigma \circ W_{0} \right) \left( \mathbf{x} \right) $$
Implementation Details
$$ \phi (x) = w_{b} b(x) + w_{s} \operatorname{spline} (x) $$ The activation function $\phi$ principally follows this form. $w_{b}$ and $w_{s}$ serve as hyperparameters, not essential and particularly $w_{s}$ meaningless during training, yet retained for model adjustment. The activation function $b(x)$ and spline $\operatorname{spline} (x)$ can be altered freely; however, for this study: $$ \begin{align*} b(x) =& \operatorname{SiLU} := x \sigma (x) = {\frac{ x }{ 1 + e^{-x} }} \\ \operatorname{spline} =& \sum_{i} c_{i} B_{i} (x) \end{align*} $$ Here, $\operatorname{SiLU}$ is the SiLU function, $\sigma$ the logistic function, and $B_{i}$ the B-spline, with the B-spline coefficients $c_{i}$ being the trainable variables. As B-splines evaluate $0$ outside the given domain, spline grids need updates according to incoming values during training. Though efficiency remains questionable, the authors’ claim that “activation functions learn” stands.
2.3 KAN’s Approximation Abilities and Scaling Laws
Now, the fundamental theorem of KAN is introduced.
Approximation theory, KAT: Given $\mathbf{x} = \left( x_{1} , \cdots , x_{n} \right)$, consider $\phi_{l,j,i}$, differentiable up to $(k-1)$ times. Suppose $f$ is represented as: $$ \begin{align*} \Phi_{l} := & \left( \sum_{i=1}^{n_{l}} \phi_{l,1,i} , \cdots , \sum_{i=1}^{n_{l}} \phi_{l,n_{l+1},i} \right) \\ f =& \left( \Phi_{L-1} \circ \cdots \circ \Phi_{1} \circ \Phi_{0} \right) \end{align*} $$ Then, for B-splines of degree $k$, with grid size $G$, expressed as $\phi_{l,j,i}^{G}$, there exists a constant $C$ depending on $f$ such that for all $0 \le m \le k$, the following is satisfied: $$ \left\| f - \left( \Phi_{L-1}^{G} \circ \cdots \circ \Phi_{1}^{G} \circ \Phi_{0}^{G} \right) \right\| \le C G^{-k-1+m} $$ Here, the norm $\left\| \cdot \right\|$ is the $C^{m}$-norm defined by the differential operator $D$ to measure the magnitude of derivatives. $$ \left\| g \right\| := \max_{| \beta | \le m} \sup_{x \in [0, 1]^{n}} \left| D^{\beta} g(x) \right| $$
Though the theorem appears straightforward to prove from B-spline properties, significant background knowledge is needed. Despite deviating from the Kolmogorov-Arnold theorem, the introduction of B-splines transparently reveals the intention. The bound on error $G^{-k-1+m}$ signifies that the approximation accuracy enhances by investing more in B-splines cost, as shown: $$ \lim_{G , k \to \infty} G^{-k-1+m} = \lim_{G , k \to \infty} {\frac{ G^{m-1} }{ G^{k} }} = 0 $$ Specifically, increasing the number of B-spline grid points $G$ and the polynomial degree $k$ reduces the upper bound on error. Indeed, this theorem is the novel theoretical foundation for new neural networks. Nonetheless, while acknowledging B-splines as the true breakthrough, the Kolmogorov-Arnold theorem only remains a motivic reference.
The authors emphasize overcoming the curse of dimensionality through a dimension-independent bound, yet practical situations may render such bounds insignificant. Future benchmark validations on image data could substantiate KAN’s claim of surmounting the curse of dimensionality.
2.4 For Accuracy: Grid Extension
It is intuitive; functions approximated by B-splines can improve performance by extending grids, meaning subdividing the spline domain or adding out-of-domain points. While appealing compared to MLPs, which are sensitive to single weight changes, practical applications requiring such granular control seem unclear. Even in the text, “Small KANs generalize better” suggests uncertainty about the grid extension’s effectiveness, so it may not warrant deep consideration.
2.5 For Interpretability: Simplifying KANs and Making them Interactive
Next, an idea to enhance interpretability by sparsifying KAN’s structure is introduced.
2.5.1 Simplification Techniques
Unlike MLPs, KAN allows complete traceability of values between layers. If activation functions converge to constant function $\mathbf{0} (\cdot)$, mapping all values to $0$, such links can be removed and replaced with human-readable expressions. Two terms are added to the loss function for this purpose.
Firstly, adopting the lasso regression idea, $L_{1}$ norm $\left| \cdot \right|_{1}$ is used, a concept intuitive for those familiar with data science. The $L_{1}$ norm $\left| \phi \right|_{1}$ for the activation function $\phi$ is defined as the average across all output values $\phi \left( x^{(s)} \right)$ for the $s$-th input value $x^{(s)}$, mathematically: $$ \left| \phi \right|_{1} := {\frac{ 1 }{ N }} \sum_{s=1}^{N} \left| \phi \left( x^{(s)} \right) \right| $$ Accordingly, the layer $\Phi$’s $L_{1}$ norm $\left| \Phi \right|_{1}$ is also naturally defined: $$ \left| \Phi \right|_{1} := \sum_{i=1}^{n_{\text{in}}} \sum_{j=1}^{n_{\text{out}}} \left| \phi_{j,i} \right|_{1} $$ However, the authors find this inadequate solely, thus defining the layer $\Phi$’s entropy $S \left( \Phi \right)$: $$ S \left( \Phi \right) := - \sum_{i=1}^{n_{\text{in}}} \sum_{j=1}^{n_{\text{out}}} {\frac{ \left| \phi_{j,i} \right|_{1} }{ \left| \Phi \right|_{1} }} \log {\frac{ \left| \phi_{j,i} \right|_{1} }{ \left| \Phi \right|_{1} }} $$ Lastly, the total loss $\mathscr{L}$, adding weight-modified original loss $\mathscr{l}$, and combining $\mu_{1}$ and $\mu_{2}$ is defined as: $$ \mathscr{L} = \mathscr{l} + \mu_{1} \sum_{l=0}^{L-1} \left| \Phi_{l} \right|_{1} + \mu_{2} \sum_{l=0}^{L-1} S \left( \Phi_{l} \right) $$ Minimizing the entropy, defined as the relative size of $\phi$, implies inducing the values of $\left| \phi \right|_{1}$ to either grow or converge to $0$.
However, the manual process of symbolification resulting from this model is somewhat ambiguous. As explained in Figure 2.4, initial visualization assists in identifying significant links using transparency $\tanh \left( 3 \left| \phi \right|_{1} \right)$ given axes $x$ and $y$. Subsequent steps include defining incoming score $I_{l,i}$ and outgoing score $O_{l,i}$ for each node: $$ \begin{align*} I_{l,i} :=& \max_{k} \left| \phi_{l-1, i, k} \right| \\ O_{l,i} :=& \max_{k} \left| \phi_{l+1, k, i} \right| \end{align*} $$ Nodes below a threshold score $\theta = 10^{-2}$ are removed. Finally, models seek suitable candidates $f$ and replace $\phi$ with $f$ sampled by inserting several $x$ values.
🤔 … the essence of finding such expressions isn’t the core of KAN. Although the equation extraction process isn’t the main value, it may appear underwhelming compared to the previous ingenious development. The authors conclude Section 2 by comparing symbolic regression and KAN, suggesting that while symbolic regression is tricky, KAN’s equation extraction is akin to revealing and modifying KAN’s neural structure.
Following sections involve comparing KAN and MLP, experimenting with various problems to demonstrate KAN’s potential, typically spanning 30 pages. Only highlighting key points is sufficient for this review.
3. KANs are Accurate
3.3 Feynman Datasets
The Feynman dataset, collected from Feynman’s books, consists of numerous equations demonstrating the efficacy of KAN across diverse equations.
3.4 Solving Partial Differential Equations
PINN primarily solves partial differential equations, with KAN and PINN shown to co-exist using an example with the Poisson equation. PINN’s core idea of including equation information in the loss function aligns with KAN’s structure.
4. KANs are Interpretable
4.3 Application to Mathematics: Knot Theory
As an example helpful to pure math research, knot theory illustrates its application. Knot theory, focusing on distinguishing knots using invariants, raises questions like ‘Can we find functions defining these invariants?’ Here, the authors emphasize KAN’s potential contribution to “AI for Math.”
4.4 Application to Physics: Anderson Localization
I am not familiar with Anderson localization to explain in detail, but the gist remains the same: KAN could aid physics research.
5. Related Works
Discussions on related studies to KAN within this field cover current trends as of 2024 and ideas leading up to KAN.
6. Discussion
Final Takeaway: Should I Use KANs or MLPs?
Which should you use, KAN or MLP? The authors recommend trying KAN in scenarios where interpretability is vital, and training time is not a significant concern, especially for moderately scaled AI and natural science problems.
Epilogue
KAN, as introduced, updates B-spline domains with each training iteration, adopting a fundamentally more complex method than existing MLPs. While popular neural networks using MLP leverage GPU’s powerful linear algebra computations, KAN inherently struggles with such brute-force calculations. Authors mention KAN’s training speed being about ten times slower than MLP models with the same number of parameters, though this estimate might even be optimistic. Deep learning excels in heuristic neural network structures, offering potential for exponential performance improvement.
Nonetheless, KAN’s theoretical foundation is clear, positioning it as a strong candidate to address the opaque nature of black-box techniques in deep learning. Naturally, attempts to resolve KAN’s speed issues have emerged:
- efficient KAN: Modifies the calculation of $\left| \phi \right|_{1}$, substituting parameter (B-spline coefficient) absolute values for the average computation of output values, maintaining similar mathematical significance while suggesting a clearer inheritance of the lasso regression method.
- FastKAN: Replaces B-splines with Gaussian radial functions, maintaining similar performance due to the resemblance to $k=3$ B-splines, albeit with significant speed improvement. Although a pragmatic engineering modification, it further distances from the Kolmogorov-Arnold representation theorem by abandoning B-splines.
Federico Girosi, Tomaso Poggio; Representation Properties of Networks: Kolmogorov’s Theorem Is Irrelevant. Neural Comput 1989; 1 (4): 465–469. doi: https://doi.org/10.1162/neco.1989.1.4.465 ↩︎
Liu, Z., Wang, Y., Vaidya, S., Ruehle, F., Halverson, J., Soljačić, M., … & Tegmark, M. (2024). Kan: Kolmogorov-arnold networks. arXiv preprint arXiv:2404.19756. https://arxiv.org/abs/2404.19756 ↩︎